ClusterIP addresses for Ingress services no longer work when bpf masquerading is enabled in native routing mode #32525

jspaleta · 2024-05-14T00:30:39Z

Is there an existing issue for this?

I have searched the existing issues

What happened?

I cannot seem to access ingress clusterIP addresses from a pod on the same node as the service backend pod if bpf masquerade is enabled in native router mode. I've got a documented Kind cluster environment in a github repo that can be used to reproduce.

In fact I can no longer access either the externalIP or the clusterIP, but the loss of the externalIP from inside the cluster isn't necessarily something I would expect to always be true. But I do expect to be able to use the clusterIP from inside the cluster.

There seems to be several bpf masquerade issues floating around, I didn't read any of them as being specific to this situation. They are probably all related, however.

I was able to isolate the symptoms to just the bpf.masquarade boolean in native routing mode.
I can't seem to trigger this at all in default tunneling routing mode.

Helm values used:

## baseline values
kubeProxyReplacement: true
routingMode: native
ipv4NativeRoutingCIDR: '10.9.0.0/16'
autoDirectNodeRoutes: true
ingressController:
  # -- Enable cilium ingress controller
  enabled: true
  default: true
  loadbalancerMode: dedicated
gatewayAPI:
  enabled: true
operator:
  replicas: 1
l2announcements:
  enabled: true
ipam:
  mode: 'cluster-pool'
  operator:
    clusterPoolIPv4PodCIDRList:
      - '10.9.0.0/16'
    clusterPoolIPv4MaskSize: 24

## Values under test  
bpf:
  masquerade: true

Cluster operates as expected when baseline helm values are used.

Cilium Version

cilium v1.5.4

Kernel Version

Linux carbon 6.7.4-200.fc39.x86_64 #1 SMP PREEMPT_DYNAMIC Mon Feb 5 22:21:14 UTC 2024 x86_64 GNU/Linux

Kubernetes Version

using kind cluster

kubectl version
Client Version: v1.28.2
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.29.2

Regression

I did not check for regression

Sysdump

cilium-sysdump-20240513-165256.zip

Relevant log output

No response

Anything else?

I've documented the baseline native routing Kind cluster environment I'm using here:
https://github.com/jspaleta/scale21x-demos/tree/main/environments/cilium-l2lb/imperial-gateway-native-routing

I'll update the issue with additional info on how to use this environment to reproduce the symptoms.

Cilium Users Document

Are you a user of Cilium? Please add yourself to the Users doc

Code of Conduct

I agree to follow this project's Code of Conduct

The text was updated successfully, but these errors were encountered:

jspaleta · 2024-05-14T00:46:35Z

Okay to reproduce the problem using the imperial-gateway-native-routing environment in my demos repo:

First I edit the cilium helm values under cilium/ directory and enable bpf.masquerade

Then I provision the kind cluster with variations of the death star tutorial service to include ingress and gateway API services fronting the deathstar service.

$ ./install.sh
<wait for cilium status to go green
# ./seed.sh

$ kubectl get services -A
NAMESPACE          NAME                               TYPE           CLUSTER-IP      EXTERNAL-IP      PORT(S)                      AGE
deathstar          cilium-ingress-deathstar-ingress   LoadBalancer   10.96.120.14    172.18.200.105   80:30658/TCP,443:30755/TCP   29m
deathstar          deathstar                          LoadBalancer   10.96.240.62    172.18.200.101   80:31390/TCP                 29m
...

$ kubectl get ingress -A
NAMESPACE          NAME                CLASS    HOSTS   ADDRESS          PORTS   AGE
deathstar          deathstar-ingress   cilium   *       172.18.200.105   80      31m
...

The tiefighter pod, running on a different node than the deathstar backend pod works when using the ingress clusterIP

$ kubectl exec -n imperial-starships -ti tiefighter -- curl -s -XPOST 10.96.120.14/v1/request-landing
Death Star Landing Request: Ship Landed!

The xwing pod, running on the same node as the deathstar backend pod doesn't work.

$ kubectl exec -n rebel-scum -ti xwing -- curl -s -XPOST 10.96.120.14/v1/request-landing
upstream connect error or disconnect/reset before headers. reset reason: connection timeout

$ kubectl describe node imperial-gateway-worker2
Non-terminated Pods:          (6 in total)
  Namespace                   Name                                  CPU Requests  CPU Limits  Memory Requests  Memory Limits  Age
  ---------                   ----                                  ------------  ----------  ---------------  -------------  ---
  deathstar                   deathstar-f64bfbf4d-8v77g             0 (0%)        0 (0%)      0 (0%)           0 (0%)         34m
  rebel-scum                  xwing                                 0 (0%)        0 (0%)      0 (0%)           0 (0%)         34m
...
$ kubectl describe node imperial-gateway-worker2

Non-terminated Pods:          (6 in total)
  Namespace                   Name                                       CPU Requests  CPU Limits  Memory Requests  Memory Limits  Age
  ---------                   ----                                       ------------  ----------  ---------------  -------------  ---
  imperial-starships          tiefighter                                 0 (0%)        0 (0%)      0 (0%)           0 (0%)         35m
...

Interesting note both the xwing and tiefighter pods are able to access the Deathstar service directly using the ClusterIP, it appears that ingress (and I'm assuming also Gateway..i need to test... services fronting the actual Deathstar service are impacted)

jspaleta · 2024-05-14T01:33:05Z

quick check and gateway has same issue, can't access via clusterIP from the same node.

I'm able to see a difference between my tie and xwing pods only because i have a replica of 1 setup for my deathstar backends in my environment.

if I had backends on both my worker nodes by scaling the deathstar deployment up, I get sort of stochastic behavior on connection attempts from both xwing and tiefighter pods.. depending on which backend is chosen to service the HTTP request.

So for diagnostic purposes its easier to keep the target service deployment to 1 backend.

sayboras · 2024-05-14T06:55:59Z

thanks for your issue, can you give it a try with bpf.legacyHostRouting enabled ?

sayboras · 2024-05-14T06:56:57Z

This could be similar to #31653

jspaleta · 2024-05-14T17:22:36Z

enabling bpf.legacyHostRouting did not help the situation.

trying to curl the clusterIP of the deathstar ingress service from a pod on the same node as the deathstar backend pod still results in the timeout error.

jspaleta · 2024-05-14T19:38:41Z

@sayboras definitely looks similar to the error in the "new" connectivity test.
I'm using latest cilium-cli release 0.16.7, looks like that test is available

running baseline native routing without bpf.masq enabled the pod-to-ingress-service test pass

$ cilium connectivity test --test='pod-to-ingress-service'
...
[=] Test [pod-to-ingress-service] [61/78]
......
[=] Test [pod-to-ingress-service-deny-all] [62/78]
W0514 13:40:59.099428   65787 warnings.go:70] unknown field "spec.enableDefaultDeny"
......
[=] Test [pod-to-ingress-service-deny-ingress-identity] [63/78]
W0514 13:41:30.985481   65787 warnings.go:70] unknown field "spec.enableDefaultDeny"
......
[=] Test [pod-to-ingress-service-deny-backend-service] [64/78]
W0514 13:41:44.875364   65787 warnings.go:70] unknown field "spec.enableDefaultDeny"
...

Note: For those following along, disregard the known spurious warning as enableDefaultDeny is added in the 1.16.0 cilium prereleases as an extension to the network policy spec and the cli tool is just being overly verbose about it as I'm running cilium 1.15.4.

enabling bpf.masquerade results in errors

$ cilium connectivity test --test='pod-to-ingress-service' --collect-sysdump-on-failure
...
📋 Test Report
❌ 2/5 tests failed (6/30 actions), 73 tests skipped, 0 scenarios skipped:
Test [pod-to-ingress-service]:
  ❌ pod-to-ingress-service/pod-to-ingress-service/curl-1: cilium-test/client-69748f45d8-7ps7g (10.9.1.17) -> cilium-test/cilium-ingress-same-node (cilium-ingress-same-node.cilium-test:80)
  ❌ pod-to-ingress-service/pod-to-ingress-service/curl-2: cilium-test/client2-ccd7b8bdf-f5fnc (10.9.1.236) -> cilium-test/cilium-ingress-same-node (cilium-ingress-same-node.cilium-test:80)
  ❌ pod-to-ingress-service/pod-to-ingress-service/curl-5: cilium-test/client3-868f7b8f6b-v47k9 (10.9.0.51) -> cilium-test/cilium-ingress-other-node (cilium-ingress-other-node.cilium-test:80)
Test [pod-to-ingress-service-allow-ingress-identity]:
  ❌ pod-to-ingress-service-allow-ingress-identity/pod-to-ingress-service/curl-0: cilium-test/client-69748f45d8-7ps7g (10.9.1.17) -> cilium-test/cilium-ingress-same-node (cilium-ingress-same-node.cilium-test:80)
  ❌ pod-to-ingress-service-allow-ingress-identity/pod-to-ingress-service/curl-2: cilium-test/client2-ccd7b8bdf-f5fnc (10.9.1.236) -> cilium-test/cilium-ingress-same-node (cilium-ingress-same-node.cilium-test:80)
  ❌ pod-to-ingress-service-allow-ingress-identity/pod-to-ingress-service/curl-5: cilium-test/client3-868f7b8f6b-v47k9 (10.9.0.51) -> cilium-test/cilium-ingress-other-node (cilium-ingress-other-node.cilium-test:80)
connectivity test failed: 2 tests failed

See attached sysdump corresponding to first of six failed actions:
cilium-sysdump-20240514-140902.zip

enabling bpf.masquerade and bpf.legacyHostRouting results in errors:

$ cilium connectivity test --test='pod-to-ingress-service' --collect-sysdump-on-failure
📋 Test Report
❌ 2/5 tests failed (6/30 actions), 73 tests skipped, 0 scenarios skipped:
Test [pod-to-ingress-service]:
  ❌ pod-to-ingress-service/pod-to-ingress-service/curl-0: cilium-test/client-69748f45d8-8pczl (10.9.0.91) -> cilium-test/cilium-ingress-same-node (cilium-ingress-same-node.cilium-test:80)
  ❌ pod-to-ingress-service/pod-to-ingress-service/curl-2: cilium-test/client2-ccd7b8bdf-ndvvt (10.9.0.55) -> cilium-test/cilium-ingress-same-node (cilium-ingress-same-node.cilium-test:80)
  ❌ pod-to-ingress-service/pod-to-ingress-service/curl-5: cilium-test/client3-868f7b8f6b-d8slh (10.9.2.12) -> cilium-test/cilium-ingress-other-node (cilium-ingress-other-node.cilium-test:80)
Test [pod-to-ingress-service-allow-ingress-identity]:
  ❌ pod-to-ingress-service-allow-ingress-identity/pod-to-ingress-service/curl-0: cilium-test/client-69748f45d8-8pczl (10.9.0.91) -> cilium-test/cilium-ingress-same-node (cilium-ingress-same-node.cilium-test:80)
  ❌ pod-to-ingress-service-allow-ingress-identity/pod-to-ingress-service/curl-2: cilium-test/client2-ccd7b8bdf-ndvvt (10.9.0.55) -> cilium-test/cilium-ingress-same-node (cilium-ingress-same-node.cilium-test:80)
  ❌ pod-to-ingress-service-allow-ingress-identity/pod-to-ingress-service/curl-5: cilium-test/client3-868f7b8f6b-d8slh (10.9.2.12) -> cilium-test/cilium-ingress-other-node (cilium-ingress-other-node.cilium-test:80)
connectivity test failed: 2 tests failed

See attached sysdump corresponding to first of six failed actions:
cilium-sysdump-20240514-153103.zip

jspaleta · 2024-05-15T18:23:39Z

just tested with 1.15.5 release, out today.. and still not working for me...even with the legacyHostRouting option enabled...

$ cilium version
cilium-cli: v0.16.7 compiled with go1.22.2 on linux/amd64
cilium image (default): v1.15.4
cilium image (stable): v1.15.5
cilium image (running): 1.15.5

kind config

ind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
name: imperial-gateway
nodes:
- role: control-plane
  kubeadmConfigPatches:
  - |
    kind: InitConfiguration
    nodeRegistration:
      kubeletExtraArgs:
        node-labels: "ingress-ready=true"
  # port forward 80 on the host to 80 on this node
  extraPortMappings:
  - containerPort: 80
    hostPort: 80
    protocol: TCP
  - containerPort: 443
    hostPort: 443
    protocol: TCP
- role: worker
- role: worker
networking:
  disableDefaultCNI: true
  kubeProxyMode: "none"
jspaleta@carbon imperial-gateway-native-routing (main *%

Passing Cilium config

## baseline
kubeProxyReplacement: true
routingMode: native
ipv4NativeRoutingCIDR: '10.9.0.0/16'
autoDirectNodeRoutes: true
ingressController:
  # -- Enable cilium ingress controller
  enabled: true
  default: true
  loadbalancerMode: dedicated
gatewayAPI:
  enabled: true
operator:
  replicas: 1
l2announcements:
  enabled: true
ipam:
  mode: 'cluster-pool'
  operator:
    clusterPoolIPv4PodCIDRList:
      - '10.9.0.0/16'
    clusterPoolIPv4MaskSize: 24

## Under test  
bpf:
  masquerade: false
  legacyHostRouting: false

Failing config

## baseline
kubeProxyReplacement: true
routingMode: native
ipv4NativeRoutingCIDR: '10.9.0.0/16'
autoDirectNodeRoutes: true
ingressController:
  # -- Enable cilium ingress controller
  enabled: true
  default: true
  loadbalancerMode: dedicated
gatewayAPI:
  enabled: true
operator:
  replicas: 1
l2announcements:
  enabled: true
ipam:
  mode: 'cluster-pool'
  operator:
    clusterPoolIPv4PodCIDRList:
      - '10.9.0.0/16'
    clusterPoolIPv4MaskSize: 24

## Under test  
bpf:
  masquerade: true
  legacyHostRouting: true

See attached sysdump for first action failure using 1.15.5

cilium-sysdump-20240515-101911.zip

sayboras · 2024-05-21T14:15:09Z

As discussed in Cilium slack, the mentioned changes were merged after 1.15.5, so you might need to check out the main branch.

jorhett · 2024-05-23T18:32:15Z

@sayboras I used the images you suggested and didn't see a fix. Can you tell me what images I should be using if these weren't the correct ones? https://cilium.slack.com/archives/C1MATJ5U5/p1715887169176509?thread_ts=1715382762.010149&cid=C1MATJ5U5

jspaleta added kind/bug This is a bug in the Cilium logic. needs/triage This issue requires triaging to establish severity and next steps. kind/community-report This was reported by a user in the Cilium community, eg via Slack. labels May 14, 2024

jspaleta mentioned this issue May 14, 2024

Pod to ingress with backend hosted on the same node not working in certain configurations #31653

Open

2 tasks

jspaleta mentioned this issue May 15, 2024

Cilium Connectivity test using external dns lookups failing when bpf masquarade enabled in native routing mode #32559

Open

3 tasks

squeed added the needs/triage This issue requires triaging to establish severity and next steps. label May 16, 2024

jorhett mentioned this issue May 30, 2024

BPF masquerading doesn't allow access to Ingress-created LoadBalancer internal or external IP #32783

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ClusterIP addresses for Ingress services no longer work when bpf masquerading is enabled in native routing mode #32525

ClusterIP addresses for Ingress services no longer work when bpf masquerading is enabled in native routing mode #32525

jspaleta commented May 14, 2024 •

edited

jspaleta commented May 14, 2024

jspaleta commented May 14, 2024 •

edited

sayboras commented May 14, 2024

sayboras commented May 14, 2024

jspaleta commented May 14, 2024

jspaleta commented May 14, 2024 •

edited

jspaleta commented May 15, 2024

sayboras commented May 21, 2024

jorhett commented May 23, 2024

ClusterIP addresses for Ingress services no longer work when bpf masquerading is enabled in native routing mode #32525

ClusterIP addresses for Ingress services no longer work when bpf masquerading is enabled in native routing mode #32525

Comments

jspaleta commented May 14, 2024 • edited

Is there an existing issue for this?

What happened?

Cilium Version

Kernel Version

Kubernetes Version

Regression

Sysdump

Relevant log output

Anything else?

Cilium Users Document

Code of Conduct

jspaleta commented May 14, 2024

jspaleta commented May 14, 2024 • edited

sayboras commented May 14, 2024

sayboras commented May 14, 2024

jspaleta commented May 14, 2024

jspaleta commented May 14, 2024 • edited

running baseline native routing without bpf.masq enabled the pod-to-ingress-service test pass

enabling bpf.masquerade results in errors

enabling bpf.masquerade and bpf.legacyHostRouting results in errors:

jspaleta commented May 15, 2024

Passing Cilium config

Failing config

sayboras commented May 21, 2024

jorhett commented May 23, 2024

jspaleta commented May 14, 2024 •

edited

jspaleta commented May 14, 2024 •

edited

jspaleta commented May 14, 2024 •

edited