Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problems with coredns timeouts and pods DNS resolution with bpf.masquerade enabled #32489

Open
2 of 3 tasks
pentago opened this issue May 12, 2024 · 15 comments
Open
2 of 3 tasks
Labels
help-wanted Please volunteer for this by adding yourself as an assignee! info-completed The GH issue has received a reply from the author kind/bug This is a bug in the Cilium logic. kind/community-report This was reported by a user in the Cilium community, eg via Slack. sig/datapath Impacts bpf/ or low-level forwarding details, including map management and monitor messages.

Comments

@pentago
Copy link

pentago commented May 12, 2024

Is there an existing issue for this?

  • I have searched the existing issues

What happened?

After enabling bpf.masquerade=true, coredns starts timeouting and other pods can't resolve anything.

Cilium Version

Client: 1.15.4 9b3f9a8 2024-04-11T17:25:42-04:00 go version go1.21.9 linux/arm64
Daemon: 1.15.4 9b3f9a8 2024-04-11T17:25:42-04:00 go version go1.21.9 linux/arm64

Kernel Version

Linux dev-control-plane 6.6.26-linuxkit #1 SMP Sat Apr 27 04:13:19 UTC 2024 aarch64 aarch64 aarch64 GNU/Linux

Kubernetes Version

Client Version: v1.30.0
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.29.2

Regression

No response

Sysdump

cilium-sysdump-20240512-205923.zip

Relevant log output

[ERROR] plugin/errors: 2 7792796240999121637.6686654646417607282. HINFO: read udp 10.42.0.9:57283->10.100.0.254:53: i/o timeout                                                         │
[ERROR] plugin/errors: 2 7792796240999121637.6686654646417607282. HINFO: read udp 10.42.0.9:38103->10.100.0.254:53: i/o timeout                                                         │
[ERROR] plugin/errors: 2 7792796240999121637.6686654646417607282. HINFO: read udp 10.42.0.9:53718->10.100.0.254:53: i/o timeout                                                         │
[ERROR] plugin/errors: 2 7792796240999121637.6686654646417607282. HINFO: read udp 10.42.0.9:33906->10.100.0.254:53: i/o timeout                                                         │
[ERROR] plugin/errors: 2 7792796240999121637.6686654646417607282. HINFO: read udp 10.42.0.9:34466->10.100.0.254:53: i/o timeout                                                         │
[ERROR] plugin/errors: 2 7792796240999121637.6686654646417607282. HINFO: read udp 10.42.0.9:60107->10.100.0.254:53: i/o timeout                                                         │
[ERROR] plugin/errors: 2 7792796240999121637.6686654646417607282. HINFO: read udp 10.42.0.9:34493->10.100.0.254:53: i/o timeout                                                         │
[ERROR] plugin/errors: 2 7792796240999121637.6686654646417607282. HINFO: read udp 10.42.0.9:41721->10.100.0.254:53: i/o timeout                                                         │
[ERROR] plugin/errors: 2 7792796240999121637.6686654646417607282. HINFO: read udp 10.42.0.9:38282->10.100.0.254:53: i/o timeout                                                         │
[ERROR] plugin/errors: 2 7792796240999121637.6686654646417607282. HINFO: read udp 10.42.0.9:35967->10.100.0.254:53: i/o timeout                                                         │
[ERROR] plugin/errors: 2 google.com. A: read udp 10.42.0.9:43732->10.100.0.254:53: i/o timeout                                                                                          │
[ERROR] plugin/errors: 2 google.com. AAAA: read udp 10.42.0.9:45840->10.100.0.254:53: i/o timeout                                                                                       │
[ERROR] plugin/errors: 2 google.com. A: read udp 10.42.0.9:33932->10.100.0.254:53: i/o timeout                                                                                          │
[ERROR] plugin/errors: 2 google.com. AAAA: read udp 10.42.0.9:38568->10.100.0.254:53: i/o timeout                                                                                       │
[ERROR] plugin/errors: 2 google.com. AAAA: read udp 10.42.0.9:38284->10.100.0.254:53: i/o timeout                                                                                       │
[ERROR] plugin/errors: 2 google.com. A: read udp 10.42.0.9:45192->10.100.0.254:53: i/o timeout                                                                                          │
[ERROR] plugin/errors: 2 google.com. AAAA: read udp 10.42.0.9:34840->10.100.0.254:53: i/o timeout                                                                                       │
[ERROR] plugin/errors: 2 google.com. A: read udp 10.42.0.9:32915->10.100.0.254:53: i/o timeout                                                                                          │


random pod log:
nginx@test-5dd9d7b595-786r7:/$ curl google.com
curl: (6) Could not resolve host: google.com

I install Cilium with this:

helm upgrade --install cilium cilium/cilium \
  --namespace kube-system \
  --set cluster.name=$CLUSTER_NAME \
  --set kubeProxyReplacement=true \
  --set ipv4.enabled=true \
  --set ipv6.enabled=false \
  --set k8sServiceHost=$CLUSTER_NAME-control-plane \
  --set k8sServicePort=6443 \
  --set ipam.mode=cluster-pool \
  --set ipam.operator.clusterPoolIPv4PodCIDRList="10.42.0.0/16" \
  --set ipam.operator.clusterPoolIPv4MaskSize=24 \
  --set k8s.requireIPv4PodCIDR=true \
  --set autoDirectNodeRoutes=true \
  --set routingMode=native \
  --set endpointRoutes.enabled=true \
  --set ipv4NativeRoutingCIDR="10.0.0.0/8" \
  --set bpf.tproxy=true \
  --set bpf.preallocateMaps=true \
  --set bpf.hostLegacyRouting=false \
  --set bpf.masquerade=true \
  --set enableIPv4Masquerade=true \
  --set encryption.enabled=true \
  --set encryption.type=wireguard \
  --set encryption.nodeEncryption=true \
  --set encryption.strictMode.enabled=true \
  --set encryption.strictMode.cidr="10.0.0.0/8" \
  --set encryption.strictMode.allowRemoteNodeIdentities=true \
  --set rollOutCiliumPods=true \
  --set operator.rollOutPods=true

cilium status output:

root@dev-worker2:/home/cilium# cilium status
KVStore:                 Ok   Disabled
Kubernetes:              Ok   1.29 (v1.29.2) [linux/arm64]
Kubernetes APIs:         ["EndpointSliceOrEndpoint", "cilium/v2::CiliumClusterwideNetworkPolicy", "cilium/v2::CiliumEndpoint", "cilium/v2::CiliumNetworkPolicy", "cilium/v2::CiliumNode", "cilium/v2alpha1::CiliumCIDRGroup", "core/v1::Namespace", "core/v1::Pods", "core/v1::Service", "networking.k8s.io/v1::NetworkPolicy"]
KubeProxyReplacement:    True   [eth0    172.18.0.2 fc00:f853:ccd:e793::2 fe80::42:acff:fe12:2 (Direct Routing)]
Host firewall:           Disabled
SRv6:                    Disabled
CNI Chaining:            none
Cilium:                  Ok   1.15.4 (v1.15.4-9b3f9a8c)
NodeMonitor:             Listening for events on 8 CPUs with 64x4096 of shared memory
Cilium health daemon:    Ok
IPAM:                    IPv4: 2/254 allocated from 10.42.2.0/24,
IPv4 BIG TCP:            Disabled
IPv6 BIG TCP:            Disabled
BandwidthManager:        Disabled
Host Routing:            BPF
Masquerading:            BPF   [eth0]   10.0.0.0/8 [IPv4: Enabled, IPv6: Disabled]
Controller Status:       18/18 healthy
Proxy Status:            OK, ip 10.42.2.223, 0 redirects active on ports 10000-20000, Envoy: embedded
Global Identity Range:   min 256, max 65535
Hubble:                  Ok              Current/Max Flows: 137/4095 (3.35%), Flows/s: 1.83   Metrics: Disabled
Encryption:              Wireguard       [NodeEncryption: Enabled, cilium_wg0 (Pubkey: vQfrUsFvKKYFvplB8kScoY0EAl5F6YLRYkYB/DbILnw=, Port: 51871, Peers: 2)]
Cluster health:          3/3 reachable   (2024-05-12T19:12:44Z)
Modules Health:          Stopped(0) Degraded(0) OK(11) Unknown(3)

Anything else?

everything works fine until bpf.masquerade is enabled.
That feature alone is the issue as I tried number of different configurations.
My environment is latest kind cluster running on Docker for Mac.

Cilium Users Document

  • Are you a user of Cilium? Please add yourself to the Users doc

Code of Conduct

  • I agree to follow this project's Code of Conduct
@pentago pentago added kind/bug This is a bug in the Cilium logic. kind/community-report This was reported by a user in the Cilium community, eg via Slack. needs/triage This issue requires triaging to establish severity and next steps. labels May 12, 2024
@squeed
Copy link
Contributor

squeed commented May 15, 2024

I'm not able to reproduce this. I installed a kind cluster with bpf.masquerade and it works as expected.

Did you try changing this setting on a running cluster, or was it from scratch?

@squeed squeed added the need-more-info More information is required to further debug or fix the issue. label May 15, 2024
@pentago
Copy link
Author

pentago commented May 15, 2024

I create a cluster from scratch each time.

In your tests, are you able to resolve anything from some test pod, like alpine packages repo?

@github-actions github-actions bot added info-completed The GH issue has received a reply from the author and removed need-more-info More information is required to further debug or fix the issue. labels May 15, 2024
@squeed
Copy link
Contributor

squeed commented May 15, 2024

I tried with your exact setup -- except for on linux -- and it worked perfectly. There must be some kind of strange discrepancy -- maybe mac is the issue?

One strange thing I see is this line in cilium-dbg status:

Encryption:                           Wireguard       [NodeEncryption: OptedOut, cilium_wg0 (Pubkey: XXX, Port: 51871, Peers: 2)]

whereas on my cluster, I see

Encryption:              Wireguard       [NodeEncryption: Enabled, cilium_wg0 (Pubkey: XXXX, Port: 51871, Peers: 1)]

Not sure if that's potentially an issue. What happens if you disable encryption?

@pentago
Copy link
Author

pentago commented May 15, 2024

I noticed encryption status come and go as I make changes to values file and apply changes by doing helm upgrade. By default it's enabled and works fine.

I suspect the issue is Mac thing as well, just not sure how to debug it.
I guess it's much complex setup on MAcs than on Linux because of Docker Desktop's underlying VM. WOuld be great to have some documentation dealing with that test case.

@squeed
Copy link
Contributor

squeed commented May 16, 2024

Yeah, at the end of the day, docker on mac is not really a supported platform; it's useful for development -- and many Cilium developers use it! But I'm not sure that we have the expertise to dig in to these sorts of issues.

@squeed squeed added the help-wanted Please volunteer for this by adding yourself as an assignee! label May 16, 2024
@pentago
Copy link
Author

pentago commented May 16, 2024

So last night I had some movements.
Aparently BPF masquerading works but if native routing is changed to tunnel mode.

Let's say my setup has:

  • Docker subnet CIDR: 10.100.0.0/24
  • nodes CIDR: 172.18.0.0/24
  • pods CIDR: 10.42.0.0/16
  • services CIDR: 10.43.0.0/16

What would be the correct value for ipv4NativeRoutingCIDR?

Perhaps that's causing the issue on my end.

Most importantly, are bpf.masquerading and routingMode:native supposed to be used together?

@pentago
Copy link
Author

pentago commented May 16, 2024

So I ran into this article where apparently, coredns configmap needs to have fixed nameserver instead of relying on /etc/resolv.conf (not sure why though).

After I tried this, there was no codedns timeout errors and traffic flows as expected.

I used this config:

cluster:
  name: dev

kubeProxyReplacement: true
ipv4:
  enabled: true
ipv6:
  enabled: false

k8sServiceHost: dev-control-plane
k8sServicePort: 6443

ipam:
  mode: cluster-pool
  operator:
    clusterPoolIPv4PodCIDRList: "10.42.0.0/16"  # Pods CIDR
    clusterPoolIPv4MaskSize: 24

k8s:
  requireIPv4PodCIDR: true

autoDirectNodeRoutes: true
routingMode: native
endpointRoutes:
  enabled: true

ipv4NativeRoutingCIDR: "10.42.0.0/16"  # Pods CIDR

bpf:
  tproxy: true
  preallocateMaps: true
  hostLegacyRouting: false
  masquerade: true

ipMasqAgent:
  enabled: true
  config:
    nonMasqueradeCIDRs:
      - 10.42.0.0/16 # Pods CIDR

enableIPv4Masquerade: true

encryption:
  enabled: true
  type: wireguard
  nodeEncryption: true
  strictMode:
    enabled: true
    cidr: "10.42.0.0/16"  # Pods CIDR
    allowRemoteNodeIdentities: true

externalIPs:
  enabled: true

nodePort:
  enabled: true

hostPort:
  enabled: true

hubble:
  enabled: true
  relay:
    enabled: true
    rollOutPods: true
  ui:
    enabled: true
    rollOutPods: true

rollOutCiliumPods: true
operator:
  rollOutPods: true

Other than this, I'd really appreciate if there are any conflicts or missconfiguration in terms of CIDRs I used in chart values that I'm not aware of.

@jorhett
Copy link

jorhett commented May 30, 2024

I have seen this problem and confirmed it. The problem is that docker adds the following rules to iptables/nftables

$ sudo iptables -t nat -S DOCKER_OUTPUT
-N DOCKER_OUTPUT
-A DOCKER_OUTPUT -d 172.18.0.1/32 -p tcp -m tcp --dport 53 -j DNAT --to-destination 127.0.0.11:40721
-A DOCKER_OUTPUT -d 172.18.0.1/32 -p udp -m udp --dport 53 -j DNAT --to-destination 127.0.0.11:38796

But when bpfMasquerade is happening these rules are never hit.

$ sudo  nft list table ip nat
...
chain DOCKER_OUTPUT {
		ip daddr 172.18.0.1 tcp dport 53 counter packets 0 bytes 0 dnat to 127.0.0.11:40721
		ip daddr 172.18.0.1 udp dport 53 counter packets 128 bytes 9338 dnat to 127.0.0.11:38796
	}

Those 128 packets were generated by me, testing from the kind node. This problem is specific to Docker and its use of netfilter which bpf is bypassing.

@pentago
Copy link
Author

pentago commented May 30, 2024

Ive been able to go away with using public resolver like 1.1.1.1 in coredns configmap instead of forwarding everything to /etc/resolv.conf.

@julianwiedmann
Copy link
Member

But when bpfMasquerade is happening these rules are never hit.

Could you check with bpf.hostLegacyRouting=true ?

@jorhett
Copy link

jorhett commented May 30, 2024

Ive been able to go away with using public resolver like 1.1.1.1 in coredns configmap instead of forwarding everything to /etc/resolv.conf.

You can do that, or you can change resolv.conf on the node and restart the coredns pods. Either one works around the problem but this should be better handled, probably by fixing the Docker DNS configuration (if possible) through Kind config

Could you check with bpf.hostLegacyRouting=true

It was observed when it was true. It's true right now, and

root@dnsutils:/# dig @172.18.0.1 ipquail.com
;; communications error to 172.18.0.1#53: timed out
;; communications error to 172.18.0.1#53: timed out

@jorhett
Copy link

jorhett commented Jun 4, 2024

@julianwiedmann look here, the CI tests for Cilium note this problem and hack around it https://github.com/cilium/cilium/blob/main/contrib/scripts/kind.sh#L250-L255

This was originally documented in #23283 and was marked resolved by #30321 but that only fixes it for people using the Cilium CI scripts, as noted in #31118 so really it's not solved at all

@ti-mo ti-mo added sig/datapath Impacts bpf/ or low-level forwarding details, including map management and monitor messages. and removed needs/triage This issue requires triaging to establish severity and next steps. labels Jun 20, 2024
Copy link

This issue has been automatically marked as stale because it has not
had recent activity. It will be closed if no further activity occurs.

@github-actions github-actions bot added the stale The stale bot thinks this issue is old. Add "pinned" label to prevent this from becoming stale. label Aug 20, 2024
@mrclrchtr
Copy link

I have the same problem, but with Talos and ARM servers.

@github-actions github-actions bot removed the stale The stale bot thinks this issue is old. Add "pinned" label to prevent this from becoming stale. label Aug 21, 2024
@samos667
Copy link

But when bpfMasquerade is happening these rules are never hit.

Could you check with bpf.hostLegacyRouting=true ?

Good see ! that's fix issue for my context

talos on arm hcloud VM with this cilium values:

values.yaml
prometheus: &prome
  enabled: false
  serviceMonitor:
    trustCRDsExist: true
    enabled: true
k8sServiceHost: 127.0.0.1
k8sServicePort: 7445

ipam:
  mode: kubernetes

routingMode: native
ipv4NativeRoutingCIDR: 10.0.0.0/16

loadBalancer:
  mode: dsr

bpf:
  hostLegacyRouting: true
  masquerade: true

envoy:
  enabled: true
  prometheus: *prome

encryption:
  enabled: true
  type: wireguard
  nodeEncryption: true

kubeProxyReplacement: true
localRedirectPolicy: true

operator:
  prometheus: *prome
  replicas: 2

hubble:
  relay:
    enabled: true
    prometheus: *prome
  ui:
    enabled: true
    rollOutPods: true
    podLabels:
      traefik.home.arpa/ingress: allow
  metrics:
    enableOpenMetrics: true
    enabled:
      - dns:query
      - drop
      - tcp
      - flow
      - port-distribution
      - icmp
      - http

resources:  # for agent
  limits:
    memory: 1Gi

### required for Talos  ###
securityContext:
  capabilities:
    ciliumAgent:
      - CHOWN
      - KILL
      - NET_ADMIN
      - NET_RAW
      - IPC_LOCK
      - SYS_ADMIN
      - SYS_RESOURCE
      - DAC_OVERRIDE
      - FOWNER
      - SETGID
      - SETUID
    cleanCiliumState: [NET_ADMIN, SYS_ADMIN, SYS_RESOURCE]

cgroup:
  autoMount:
    enabled: false
  hostRoot: /sys/fs/cgroup

logOptions:
  format: json

With local-cache-dns LRP setup.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help-wanted Please volunteer for this by adding yourself as an assignee! info-completed The GH issue has received a reply from the author kind/bug This is a bug in the Cilium logic. kind/community-report This was reported by a user in the Cilium community, eg via Slack. sig/datapath Impacts bpf/ or low-level forwarding details, including map management and monitor messages.
Projects
None yet
Development

No branches or pull requests

7 participants