-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Problems with coredns timeouts and pods DNS resolution with bpf.masquerade enabled #32489
Comments
I'm not able to reproduce this. I installed a kind cluster with Did you try changing this setting on a running cluster, or was it from scratch? |
I create a cluster from scratch each time. In your tests, are you able to resolve anything from some test pod, like alpine packages repo? |
I tried with your exact setup -- except for on linux -- and it worked perfectly. There must be some kind of strange discrepancy -- maybe mac is the issue? One strange thing I see is this line in
whereas on my cluster, I see
Not sure if that's potentially an issue. What happens if you disable encryption? |
I noticed encryption status come and go as I make changes to values file and apply changes by doing I suspect the issue is Mac thing as well, just not sure how to debug it. |
Yeah, at the end of the day, docker on mac is not really a supported platform; it's useful for development -- and many Cilium developers use it! But I'm not sure that we have the expertise to dig in to these sorts of issues. |
So last night I had some movements. Let's say my setup has:
What would be the correct value for Perhaps that's causing the issue on my end. Most importantly, are |
So I ran into this article where apparently, coredns configmap needs to have fixed nameserver instead of relying on After I tried this, there was no codedns timeout errors and traffic flows as expected. I used this config:
Other than this, I'd really appreciate if there are any conflicts or missconfiguration in terms of CIDRs I used in chart values that I'm not aware of. |
I have seen this problem and confirmed it. The problem is that docker adds the following rules to iptables/nftables $ sudo iptables -t nat -S DOCKER_OUTPUT
-N DOCKER_OUTPUT
-A DOCKER_OUTPUT -d 172.18.0.1/32 -p tcp -m tcp --dport 53 -j DNAT --to-destination 127.0.0.11:40721
-A DOCKER_OUTPUT -d 172.18.0.1/32 -p udp -m udp --dport 53 -j DNAT --to-destination 127.0.0.11:38796 But when bpfMasquerade is happening these rules are never hit. $ sudo nft list table ip nat
...
chain DOCKER_OUTPUT {
ip daddr 172.18.0.1 tcp dport 53 counter packets 0 bytes 0 dnat to 127.0.0.11:40721
ip daddr 172.18.0.1 udp dport 53 counter packets 128 bytes 9338 dnat to 127.0.0.11:38796
} Those 128 packets were generated by me, testing from the kind node. This problem is specific to Docker and its use of netfilter which bpf is bypassing. |
Ive been able to go away with using public resolver like 1.1.1.1 in coredns configmap instead of forwarding everything to /etc/resolv.conf. |
Could you check with |
You can do that, or you can change resolv.conf on the node and restart the coredns pods. Either one works around the problem but this should be better handled, probably by fixing the Docker DNS configuration (if possible) through Kind config
It was observed when it was true. It's true right now, and root@dnsutils:/# dig @172.18.0.1 ipquail.com
;; communications error to 172.18.0.1#53: timed out
;; communications error to 172.18.0.1#53: timed out |
@julianwiedmann look here, the CI tests for Cilium note this problem and hack around it https://github.com/cilium/cilium/blob/main/contrib/scripts/kind.sh#L250-L255 This was originally documented in #23283 and was marked resolved by #30321 but that only fixes it for people using the Cilium CI scripts, as noted in #31118 so really it's not solved at all |
This issue has been automatically marked as stale because it has not |
I have the same problem, but with Talos and ARM servers. |
Good see ! that's fix issue for my context talos on arm hcloud VM with this cilium values: values.yamlprometheus: &prome
enabled: false
serviceMonitor:
trustCRDsExist: true
enabled: true
k8sServiceHost: 127.0.0.1
k8sServicePort: 7445
ipam:
mode: kubernetes
routingMode: native
ipv4NativeRoutingCIDR: 10.0.0.0/16
loadBalancer:
mode: dsr
bpf:
hostLegacyRouting: true
masquerade: true
envoy:
enabled: true
prometheus: *prome
encryption:
enabled: true
type: wireguard
nodeEncryption: true
kubeProxyReplacement: true
localRedirectPolicy: true
operator:
prometheus: *prome
replicas: 2
hubble:
relay:
enabled: true
prometheus: *prome
ui:
enabled: true
rollOutPods: true
podLabels:
traefik.home.arpa/ingress: allow
metrics:
enableOpenMetrics: true
enabled:
- dns:query
- drop
- tcp
- flow
- port-distribution
- icmp
- http
resources: # for agent
limits:
memory: 1Gi
### required for Talos ###
securityContext:
capabilities:
ciliumAgent:
- CHOWN
- KILL
- NET_ADMIN
- NET_RAW
- IPC_LOCK
- SYS_ADMIN
- SYS_RESOURCE
- DAC_OVERRIDE
- FOWNER
- SETGID
- SETUID
cleanCiliumState: [NET_ADMIN, SYS_ADMIN, SYS_RESOURCE]
cgroup:
autoMount:
enabled: false
hostRoot: /sys/fs/cgroup
logOptions:
format: json With local-cache-dns LRP setup. |
Is there an existing issue for this?
What happened?
After enabling
bpf.masquerade=true
, coredns starts timeouting and other pods can't resolve anything.Cilium Version
Client: 1.15.4 9b3f9a8 2024-04-11T17:25:42-04:00 go version go1.21.9 linux/arm64
Daemon: 1.15.4 9b3f9a8 2024-04-11T17:25:42-04:00 go version go1.21.9 linux/arm64
Kernel Version
Linux dev-control-plane 6.6.26-linuxkit #1 SMP Sat Apr 27 04:13:19 UTC 2024 aarch64 aarch64 aarch64 GNU/Linux
Kubernetes Version
Client Version: v1.30.0
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.29.2
Regression
No response
Sysdump
cilium-sysdump-20240512-205923.zip
Relevant log output
I install Cilium with this:
Anything else?
everything works fine until bpf.masquerade is enabled.
That feature alone is the issue as I tried number of different configurations.
My environment is latest kind cluster running on Docker for Mac.
Cilium Users Document
Code of Conduct
The text was updated successfully, but these errors were encountered: