Network policy in dual-stack cluster stops allowing ingress after pod restart #10053

jfmontanaro · 2024-05-01T02:04:19Z

Environmental Info:
K3s Version: v1.29.4+k3s1 (94e29e2)
go version go1.21.9

Node(s) CPU architecture, OS, and Version:

Linux kube-test 5.15.0-102-generic #112-Ubuntu SMP Tue Mar 5 16:50:32 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux

Cluster Configuration:

single-node cluster for testing

Describe the bug:

When running a cluster in dual-stack mode, network policies that allow specific ingress to targeted pods stop allowing that ingress after the pod has been destroyed/re-created.

Steps To Reproduce:

Install K3s in dual-stack mode.
Bring up a pod running traefik/whoami:

kind: Pod
apiVersion: v1
metadata:
  name: server
  namespace: default
  labels:
    role: server
spec:
  containers:
    - name: whoami
      image: traefik/whoami

Bring up another pod that you can exec into to test connectivity:

kind: Pod
apiVersion: v1
metadata:
  name: client
  namespace: default
spec:
  containers:
    - name: alpine
      image: alpine:latest
      command: [/bin/sh]
      args: [-c, 'sleep infinity']

Test connectivity (I'm just using bare pod IPs right now, but I've tested with services and the result is the same) - should get a response from the server:

$ kubectl exec -it pod/client -- sh
/ # wget -O - http://10.42.0.14
Connecting to 10.42.0.14 (10.42.0.14:80)
writing to stdout
Hostname: server
IP: 127.0.0.1
...

Apply network policy limiting access to port 80/tcp:

kind: NetworkPolicy
apiVersion: networking.k8s.io/v1
metadata:
  name: example
spec:
  podSelector:
    matchLabels:
      role: server
  policyTypes:
  - Ingress
  ingress:
  - ports:
    - port: 80
      protocol: TCP

Test connectivity again, it should still work
Delete and re-create server pod from above
Test connectivity one last time (note, IP will have changed):

/ # wget -O - http://10.42.0.16
Connecting to 10.42.0.16 (10.42.0.16:80)
wget: can't connect to remote host (10.42.0.16): Connection refused

Delete policy and test again. Server should be reachable again.

Expected behavior:

Server should be reachable from client regardless of restarts.

Actual behavior:

Server is reachable from client only until server has been restarted, after which it goes dark.

Additional context / logs:
It's interesting to me that the connection is being refused, rather than timing out. I may be misremembering, but I thought that network policies simply dropped traffic that wasn't permitted, rather than refusing it. I could be misremembering, though.

Behavior initially observed on K3s v1.28.8+k3s1, upgraded to v1.29.4+k3s1 to see if that would fix it but no dice.

The text was updated successfully, but these errors were encountered:

brandond · 2024-05-01T22:56:47Z

This works fine for me. Are you perhaps just trying to make a request very quickly, before the network policies have synced? They are not instantaneous, there may be a few seconds following the creation of a new pod where the policies are not yet in effect.

/ # while true; do wget -qO - http://10.42.0.4; sleep 1; done
Hostname: server
IP: 127.0.0.1
IP: ::1
IP: 10.42.0.4
IP: fe80::48f4:edff:fe25:9ac5
RemoteAddr: 10.42.0.5:60158
GET / HTTP/1.1
Host: 10.42.0.4
User-Agent: Wget
Connection: close

Hostname: server
IP: 127.0.0.1
IP: ::1
IP: 10.42.0.4
IP: fe80::48f4:edff:fe25:9ac5
RemoteAddr: 10.42.0.5:60172
GET / HTTP/1.1
Host: 10.42.0.4
User-Agent: Wget
Connection: close

Hostname: server
IP: 127.0.0.1
IP: ::1
IP: 10.42.0.4
IP: fe80::48f4:edff:fe25:9ac5
RemoteAddr: 10.42.0.5:60174
GET / HTTP/1.1
Host: 10.42.0.4
User-Agent: Wget
Connection: close

Hostname: server
IP: 127.0.0.1
IP: ::1
IP: 10.42.0.4
IP: fe80::48f4:edff:fe25:9ac5
RemoteAddr: 10.42.0.5:49488
GET / HTTP/1.1
Host: 10.42.0.4
User-Agent: Wget
Connection: close

Hostname: server
IP: 127.0.0.1
IP: ::1
IP: 10.42.0.4
IP: fe80::48f4:edff:fe25:9ac5
RemoteAddr: 10.42.0.5:49490
GET / HTTP/1.1
Host: 10.42.0.4
User-Agent: Wget
Connection: close
# ---
# --- here is where I deleted the server pod - note that I get only a single error and then no further output until I hit control-c to break out of the loop
# ---
wget: can't connect to remote host (10.42.0.4): Connection refused
^C
# ---
# --- I then re-created the server pod and started querying it at its new address within a few seconds
# ---
/ # while true; do wget -qO - http://10.42.0.6; sleep 1; done
Hostname: server
IP: 127.0.0.1
IP: ::1
IP: 10.42.0.6
IP: fe80::200e:d6ff:fe7c:26d1
RemoteAddr: 10.42.0.5:43828
GET / HTTP/1.1
Host: 10.42.0.6
User-Agent: Wget
Connection: close

Hostname: server
IP: 127.0.0.1
IP: ::1
IP: 10.42.0.6
IP: fe80::200e:d6ff:fe7c:26d1
RemoteAddr: 10.42.0.5:43832
GET / HTTP/1.1
Host: 10.42.0.6
User-Agent: Wget
Connection: close

Hostname: server
IP: 127.0.0.1
IP: ::1
IP: 10.42.0.6
IP: fe80::200e:d6ff:fe7c:26d1
RemoteAddr: 10.42.0.5:43834
GET / HTTP/1.1
Host: 10.42.0.6
User-Agent: Wget
Connection: close

Hostname: server
IP: 127.0.0.1
IP: ::1
IP: 10.42.0.6
IP: fe80::200e:d6ff:fe7c:26d1
RemoteAddr: 10.42.0.5:43836
GET / HTTP/1.1
Host: 10.42.0.6
User-Agent: Wget
Connection: close

Hostname: server
IP: 127.0.0.1
IP: ::1
IP: 10.42.0.6
IP: fe80::200e:d6ff:fe7c:26d1
RemoteAddr: 10.42.0.5:43848
GET / HTTP/1.1
Host: 10.42.0.6
User-Agent: Wget
Connection: close

Hostname: server
IP: 127.0.0.1
IP: ::1
IP: 10.42.0.6
IP: fe80::200e:d6ff:fe7c:26d1
RemoteAddr: 10.42.0.5:43860
GET / HTTP/1.1
Host: 10.42.0.6
User-Agent: Wget
Connection: close

jfmontanaro · 2024-05-02T01:58:42Z

That's odd. No, I've tried waiting for as long as several minutes and for me connectivity is never restored.

I've created a new repo here with the manifests I'm using to test, plus a script that puts it all together to make sure we're doing the exact same things. You should be able to replicate by just spinning up a new server, then:

git clone https://github.com/jfmontanaro/k3s-netpol-issue-demo.git
cd k3s-netpol-issue-demo
./run.sh

Note that I did add a service to front the server pod, in contrast to how I was doing it above, so that I didn't have to bother figuring out the IP addresses for direct communication with pods. In all my testing, however, it hasn't made a difference whether I was talking to a pod or a service fronting that pod.

I've been testing on VMs from Vultr, if that makes any difference. I can't think why it should but if it still doesn't replicate I don't know what else could be making the difference.

brandond · 2024-05-02T07:24:13Z

I used the exact pod manifests you provided in your initial message. Is there anything else odd about your configuration? Are you sure it's network policy that is blocking your traffic?

jfmontanaro · 2024-05-03T16:00:16Z

I think it's likely that the network policy is at least involved, given that the issue goes away immediately if I remove the network policy.

I notice that the responses you are getting from your whoami service show only link-local addresses for the server's IPv6 address; are you using a dual-stack cluster? I've only seen this issue in dual stack clusters.

brandond · 2024-05-06T20:36:47Z

Do you have any errors in the k3s logs? I wonder if something is going on with the ipv6 rule sync?

brandond · 2024-05-06T20:37:08Z

@manuelbuil do you have any ideas on what else to look at here?

ghost · 2024-05-07T09:14:33Z

I have a similar issue as well. Sometimes, when I restart the pods, the network policy does not apply for minutes. Making changes to the network policy or deleting the pod again causes it to work sometimes as well.

node-external-ip: "<ext4>,<ext6>"
flannel-iface: eno2

flannel-backend: host-gw
flannel-ipv6-masq: true
tls-san: <domain>
cluster-domain: <domain>
cluster-cidr: "<int4>/13,<int6>/56"
service-cidr: "<int4/16,<int6>/112"
secrets-encryption: true
kube-controller-manager-arg:
  - "node-cidr-mask-size-ipv4=23"
  - "node-cidr-mask-size-ipv6=64"
etcd-expose-metrics: true
disable: local-storage

The problem becomes more frequent when using Egress as well.

Unfortunately, I don't have much details yet but I'll try to do some investigation.

jfmontanaro · 2024-05-07T18:31:17Z

Do you have any errors in the k3s logs? I wonder if something is going on with the ipv6 rule sync?

Just tested this, going from a working state (pod running and receiving traffic) to a broken state (pod running but inaccessible) via delete / recreate.

After deleting the working pod, I get this output:

May 07 18:23:52 vultr k3s[4635]: I0507 18:23:52.274001    4635 topology_manager.go:215] "Topology Admit Handler" podUID="14693fc2-b378-4c6c-8764-aee596c32cd2" podNamespace="default" podName="server"
May 07 18:23:52 vultr k3s[4635]: E0507 18:23:52.284261    4635 cpu_manager.go:395] "RemoveStaleState: removing container" podUID="2cd9f56a-f17c-44bf-b886-36a419d919e1" containerName="whoami"
May 07 18:23:52 vultr k3s[4635]: I0507 18:23:52.284496    4635 memory_manager.go:354] "RemoveStaleState removing state" podUID="2cd9f56a-f17c-44bf-b886-36a419d919e1" containerName="whoami"
May 07 18:23:52 vultr k3s[4635]: I0507 18:23:52.460776    4635 reconciler_common.go:258] "operationExecutor.VerifyControllerAttachedVolume started for volume \"kube-api-access-94kdb\" (UniqueName: \"kubernetes.io/projected/14693fc2-b378-4c6c-8764-aee596c32cd2-kube-api-access-94kdb\") pod \"server\" (UID: \"14693fc2-b378-4c6c-8764-aee596c32cd2\") " pod="default/server"

When I then re-create the pod, I get this:

May 07 18:23:52 vultr k3s[4635]: I0507 18:23:52.274001    4635 topology_manager.go:215] "Topology Admit Handler" podUID="14693fc2-b378-4c6c-8764-aee596c32cd2" podNamespace="default" podName="server"
May 07 18:23:52 vultr k3s[4635]: E0507 18:23:52.284261    4635 cpu_manager.go:395] "RemoveStaleState: removing container" podUID="2cd9f56a-f17c-44bf-b886-36a419d919e1" containerName="whoami"
May 07 18:23:52 vultr k3s[4635]: I0507 18:23:52.284496    4635 memory_manager.go:354] "RemoveStaleState removing state" podUID="2cd9f56a-f17c-44bf-b886-36a419d919e1" containerName="whoami"
May 07 18:23:52 vultr k3s[4635]: I0507 18:23:52.460776    4635 reconciler_common.go:258] "operationExecutor.VerifyControllerAttachedVolume started for volume \"kube-api-access-94kdb\" (UniqueName: \"kubernetes.io/projected/14693fc2-b378-4c6c-8764-aee596c32cd2-kube-api-access-94kdb\") pod \"server\" (UID: \"14693fc2-b378-4c6c-8764-aee596c32cd2\") " pod="default/server"

Finally, when I try (and fail) to make an http request to the running pod, sometimes I will see output like this:

May 07 18:24:19 vultr k3s[4635]: E0507 18:24:19.361668    4635 upgradeaware.go:425] Error proxying data from client to backend: readfrom tcp 127.0.0.1:58364->127.0.0.1:10010: write tcp 127.0.0.1:58364->127.0.0.1:10010: write: broken pipe
May 07 18:24:19 vultr k3s[4635]: E0507 18:24:19.364346    4635 upgradeaware.go:439] Error proxying data from backend to client: write tcp 127.0.0.1:10250->127.0.0.1:60376: write: connection reset by peer

However, that last doesn't happen all the time.

In addition, separately from my running any tests, I have seen this message come up:

May 07 18:27:18 vultr k3s[4635]: E0507 18:27:18.837279    4635 kubelet_node_status.go:711] "Failed to set some node status fields" err="failed to validate secondaryNodeIP: node IP: \"2001:19f0:5401:2db0:5400:4ff:fee6:63ac\" not found in the host's network interfaces" node="vultr"

But I think that's just because I used a publicly-routable IPv6 range that doesn't actually route to the server I'm using.

manuelbuil · 2024-05-08T11:02:06Z

Thanks for reporting this, I was able to reproduce it. I suspect this is a bug in the kube-router network policy controller.

When following your steps on a dual-stack environment, the first time I create the server, if I execute ipset list I can see both server IPs: 10.42.1.5 and 2001:cafe:42:1::5

Name: KUBE-DST-2FAIIK2E4RIPMTGF
Type: hash:ip
Revision: 5
Header: family inet hashsize 1024 maxelem 65536 timeout 0
Size in memory: 280
References: 2
Number of entries: 1
Members:
10.42.1.5 timeout 0

Name: inet6:KUBE-DST-I3PRO5XXEERITJZO
Type: hash:ip
Revision: 5
Header: family inet6 hashsize 1024 maxelem 65536 timeout 0
Size in memory: 304
References: 2
Number of entries: 1
Members:
2001:cafe:42:1::5 timeout 0

After removing the server and recreating again, I can only see the ipv6 address: 2001:cafe:42:1::6:

Name: KUBE-DST-2FAIIK2E4RIPMTGF
Type: hash:ip
Revision: 5
Header: family inet hashsize 1024 maxelem 65536 timeout 0
Size in memory: 216
References: 2
Number of entries: 0
Members:

Name: inet6:KUBE-DST-I3PRO5XXEERITJZO
Type: hash:ip
Revision: 5
Header: family inet6 hashsize 1024 maxelem 65536 timeout 0
Size in memory: 304
References: 2
Number of entries: 1
Members:
2001:cafe:42:1::6 timeout 0

The ipv4 memeber list has no IPs available. As a consequence, the verdict will always be false when the iptables rule checks that ipset.

Workaround: Change your ipFamilyPolicy: service config to PreferDualStack. The ipv6 config is correct, so the wget command will work

jfmontanaro · 2024-05-08T16:09:44Z

Hi @manuelbuil, thanks for the workaround! Can confirm it works in my testing environment.

manuelbuil · 2024-05-09T14:27:30Z

It's definitely a bug in upstream kube-router. The problem is that each ipFamily (ipv4 & ipv6) carries an ipset handler object. However, that ipset handler is tracking ipsets of both ipFamilies but only updates its ipFamily ipset correctly, the other one is left with outdated data (especially ipv6). As the ipv6 ipset is always the second one to be "refreshed", it overwrites the ipv4 ipset with that outdated data. I'm going to collect more information and open a bug issue in kube-router upstream

manuelbuil · 2024-05-09T16:12:59Z

cloudnativelabs/kube-router#1665

manuelbuil · 2024-05-10T06:06:51Z

kube-router maintainer created a fix. I tested and it fixes the issue :)

jfmontanaro · 2024-05-10T17:41:17Z

Wow, that's great! Thanks for getting this addressed so quickly!

rancher-max · 2024-05-29T23:46:33Z

Validated using commitid f2e7c01acfdc5f51bfd007c44bfe6605e8864975 from the master branch via the linked automated test above and added to our regression suite.

Thank you so much for this report and very clear reproduction steps! The PR linked has the steps and details of the test results for those curious. This fix will be available for general consumption in the June patch releases.

jfmontanaro · 2024-05-30T18:37:47Z

Great, thanks so much for the speedy response!

brandond assigned manuelbuil and rbrtbnfgl May 6, 2024

manuelbuil added the kind/bug Something isn't working label May 8, 2024

brandond added the kind/upstream-issue This issue appears to be caused by an upstream bug label May 9, 2024

rbrtbnfgl added this to the v1.29.6+k3s1 milestone May 15, 2024

rbrtbnfgl mentioned this issue May 15, 2024

Update kube-router to v2.1.2 #10102

Closed

rbrtbnfgl mentioned this issue May 24, 2024

Update kube-router version to v2.1.2 #10177

Merged

rbrtbnfgl modified the milestones: v1.29.6+k3s1, v1.30.2+k3s1 May 24, 2024

This was referenced May 24, 2024

[Release 1.29] Update Kube-router to v2.1.2 #10178

Closed

[Release 1.28] Update Kube-router to v2.1.2 #10179

Closed

[Release 1.27] Update Kube-router to v2.1.2 #10180

Closed

rancher-max self-assigned this May 24, 2024

rancher-max mentioned this issue May 29, 2024

Add dualstack test to validate k3s issue 10053 rancher/distros-test-framework#106

Merged

rancher-max closed this as completed May 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Network policy in dual-stack cluster stops allowing ingress after pod restart #10053

Network policy in dual-stack cluster stops allowing ingress after pod restart #10053

jfmontanaro commented May 1, 2024

brandond commented May 1, 2024 •

edited

Loading

jfmontanaro commented May 2, 2024

brandond commented May 2, 2024

jfmontanaro commented May 3, 2024

brandond commented May 6, 2024

brandond commented May 6, 2024

ghost commented May 7, 2024 •

edited by ghost

Loading

jfmontanaro commented May 7, 2024

manuelbuil commented May 8, 2024

jfmontanaro commented May 8, 2024

manuelbuil commented May 9, 2024

manuelbuil commented May 9, 2024

manuelbuil commented May 10, 2024

jfmontanaro commented May 10, 2024

rancher-max commented May 29, 2024

jfmontanaro commented May 30, 2024

Network policy in dual-stack cluster stops allowing ingress after pod restart #10053

Network policy in dual-stack cluster stops allowing ingress after pod restart #10053

Comments

jfmontanaro commented May 1, 2024

brandond commented May 1, 2024 • edited Loading

jfmontanaro commented May 2, 2024

brandond commented May 2, 2024

jfmontanaro commented May 3, 2024

brandond commented May 6, 2024

brandond commented May 6, 2024

ghost commented May 7, 2024 • edited by ghost Loading

jfmontanaro commented May 7, 2024

manuelbuil commented May 8, 2024

jfmontanaro commented May 8, 2024

manuelbuil commented May 9, 2024

manuelbuil commented May 9, 2024

manuelbuil commented May 10, 2024

jfmontanaro commented May 10, 2024

rancher-max commented May 29, 2024

jfmontanaro commented May 30, 2024

brandond commented May 1, 2024 •

edited

Loading

ghost commented May 7, 2024 •

edited by ghost

Loading