-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Gateway API backend PODs intermittent timeouts in hostNetwork mode #32592
Comments
It seems like intra-node routing is not working. Can you try the troubleshooting steps to see if those uncover anything? |
Hi @squeed The deployment of connectivity-check.yaml did not make any issue. All are running fine. Test Report Test [no-unexpected-packet-drops] [1/78] All outputs are attached in the message "cilium-issue-32592.log". Thanks for your help ;-) |
I noticed that a lot of tests were skipped in previous execution, due to ingress not being deployed. Results has changed: [no-unexpected-packet-drops]: No errors anymore but other issues reported 📋 Test Report Attached detailed output and sysdump done just after the test cilium-issue-32592-with-ingress.log Thanks |
Gut feeling, hostNetwork might not work in tandem with chaining? Letting someone more knowledgeable chime in. |
@lmb I disabled Gateway hostNetwork mode and issue still here I also tried with different version of cilium and tried also AL2023 (latest) without more success. Here are some envoy logs with/out failure. Pod and envoy on different nodes: [2024-05-22 11:53:13.520][22][debug][router] [external/envoy/source/common/router/router.cc:478] [Tags: "ConnectionId":"381","StreamId":"5462284546592542596"] cluster 'kube-system/cilium-gateway-gitlab-iac-shared-https/servicemesh-mtls:servicemesh-mtls-pod-1-svc:8080' match for URL '/' ########### 5 SECONDS (Client receives 503 from envoy at the end) ########### [2024-05-22 11:53:18.517][22][trace][http] [external/envoy/source/common/http/filter_manager.cc:1169] [Tags: "ConnectionId":"381","StreamId":"5462284546592542596"] encode headers called: filter=cilium.l7policy status=0 Pod and envoy on same nodes: [2024-05-22 11:56:32.545][22][debug][router] [external/envoy/source/common/router/router.cc:478] [Tags: "ConnectionId":"397","StreamId":"1180725351791552725"] cluster 'kube-system/cilium-gateway-gitlab-iac-shared-https/servicemesh-mtls:servicemesh-mtls-pod-1-svc:8080' match for URL '/' Attaching few debug logs from Hubble and envoy focused on the connection_id and stream_id (success and failure), plus our Cilium configmap. Thanks for your help ! tests_from_localhost.txt |
I tried the following as a workaround because it seems to be inter-nodes issue, and it worked ! Node to node encryption"encryption.enabled" = "true" I can access to the backend from any nodes with this. 100% success |
I have a very similar setup, except with kube-proxy, and I also see this issue. Toggling encryption does not work for me. |
Is there an existing issue for this?
What happened?
AWS ALB healthcheck are green against all EKS workernodes. That's to say gateway listener ports are reachable from outside the cluster on all workernodes.
When trying to reach the webserver through ALB/Targetgroup, the HTTP call may timeout after 5 seconds.
After some investigation, 100% of calls are success if the envoy/cilium POD receiving ALB request and the backend webserver POD are hosted on the same workernode.
If the envoy/cilium POD receiving the ALB HTTP request is not located on the same node than the backend webserver, 100% calls end with following error :
upstream connect error or disconnect/reset before headers. reset reason: connection timeout
Cilium Version
1.16.0-pre.2
Kernel Version
Tested in AL2 kernel 5.10/5.15 and AL2023 (latest kernel)
Kubernetes Version
AWS EKS 1.27 in VPC-CNI chaining mode without kube-proxy
Regression
hostNetwork mode is a new feature if I'm correct
Sysdump
cilium-sysdump-20240516-174847.zip
pod-node.pcap.zip
envoy-node.pcap.zip
Relevant log output
Anything else?
Helm values
Chaining with AWS VPC CNI
cni.chainingMode = aws-cni
cni.exclusive = false
enableIPv4Masquerade = false
routingMode = native
endpointRoutes.enabled = true
kube-proxy-free
kubeProxyReplacement = true
GatewayAPI configuration in hostNetwork mode
gatewayAPI.enabled = true
envoy.enabled = false # SE Linux enabled
gatewayAPI.hostNetwork.enabled = true
Avoid warning on non-routable interface
devices[0] = eth0
Proxy enabled
l7Proxy = true
loadBalancer.l7.backend = envoy
tried with and with/without combinations of following helm values, with no improvements
loadBalancer.l7.backend = envoy
l2announcements.enabled = true
ipv4NativeRoutingCIDR = 10.28.0.0/16
bpf.masquerade = true
bpf.hostLegacyRouting = true
bpf.tproxy = true
externalIPs.enabled = true
localRedirectPolicy = true
enableIPv4Masquerade = true
ipam.operator.clusterPoolIPv4PodCIDRList = 10.28.0.0/16
tunnelProtocol = geneve
loadBalancer.mode = dsr
loadBalancer.dsrDispatch = opt
loadBalancer.acceleration = best-effort
enableMasqueradeRouteSource = true
Cilium Users Document
Code of Conduct
The text was updated successfully, but these errors were encountered: