Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gateway API backend PODs intermittent timeouts in hostNetwork mode #32592

Open
2 of 3 tasks
bhm-kyndryl opened this issue May 16, 2024 · 7 comments
Open
2 of 3 tasks

Gateway API backend PODs intermittent timeouts in hostNetwork mode #32592

bhm-kyndryl opened this issue May 16, 2024 · 7 comments
Labels
area/proxy Impacts proxy components, including DNS, Kafka, Envoy and/or XDS servers. feature/k8s-gateway-api info-completed The GH issue has received a reply from the author kind/bug This is a bug in the Cilium logic. kind/community-report This was reported by a user in the Cilium community, eg via Slack. needs/triage This issue requires triaging to establish severity and next steps. sig/agent Cilium agent related.

Comments

@bhm-kyndryl
Copy link

Is there an existing issue for this?

  • I have searched the existing issues

What happened?

AWS ALB healthcheck are green against all EKS workernodes. That's to say gateway listener ports are reachable from outside the cluster on all workernodes.

When trying to reach the webserver through ALB/Targetgroup, the HTTP call may timeout after 5 seconds.
After some investigation, 100% of calls are success if the envoy/cilium POD receiving ALB request and the backend webserver POD are hosted on the same workernode.

If the envoy/cilium POD receiving the ALB HTTP request is not located on the same node than the backend webserver, 100% calls end with following error :

upstream connect error or disconnect/reset before headers. reset reason: connection timeout

Cilium Version

1.16.0-pre.2

Kernel Version

Tested in AL2 kernel 5.10/5.15 and AL2023 (latest kernel)

Kubernetes Version

AWS EKS 1.27 in VPC-CNI chaining mode without kube-proxy

Regression

hostNetwork mode is a new feature if I'm correct

Sysdump

cilium-sysdump-20240516-174847.zip
pod-node.pcap.zip
envoy-node.pcap.zip

Relevant log output

PCAP contains tcp flows while doing some failing tests

List of IPs used in pcap dump files and cilium sysdump :

Envoy Node IP: 10.28.16.96/22
Backend node IP: 10.28.23.136/22
POD IP: 10.28.21.15
AWS ALB IPs: 10.29.137.160/27, 10.29.137.224/27, 10.29.137.192/27
GatewayAPI Listener port (SSL Termination mode): 32700

Anything else?

Helm values

Chaining with AWS VPC CNI

cni.chainingMode = aws-cni
cni.exclusive = false
enableIPv4Masquerade = false
routingMode = native
endpointRoutes.enabled = true

kube-proxy-free

kubeProxyReplacement = true

GatewayAPI configuration in hostNetwork mode

gatewayAPI.enabled = true
envoy.enabled = false # SE Linux enabled
gatewayAPI.hostNetwork.enabled = true

Avoid warning on non-routable interface

devices[0] = eth0

Proxy enabled

l7Proxy = true
loadBalancer.l7.backend = envoy

tried with and with/without combinations of following helm values, with no improvements

loadBalancer.l7.backend = envoy
l2announcements.enabled = true
ipv4NativeRoutingCIDR = 10.28.0.0/16
bpf.masquerade = true
bpf.hostLegacyRouting = true
bpf.tproxy = true
externalIPs.enabled = true
localRedirectPolicy = true
enableIPv4Masquerade = true
ipam.operator.clusterPoolIPv4PodCIDRList = 10.28.0.0/16
tunnelProtocol = geneve
loadBalancer.mode = dsr
loadBalancer.dsrDispatch = opt
loadBalancer.acceleration = best-effort
enableMasqueradeRouteSource = true

Cilium Users Document

  • Are you a user of Cilium? Please add yourself to the Users doc

Code of Conduct

  • I agree to follow this project's Code of Conduct
@bhm-kyndryl bhm-kyndryl added kind/bug This is a bug in the Cilium logic. kind/community-report This was reported by a user in the Cilium community, eg via Slack. needs/triage This issue requires triaging to establish severity and next steps. labels May 16, 2024
@squeed
Copy link
Contributor

squeed commented May 17, 2024

It seems like intra-node routing is not working. Can you try the troubleshooting steps to see if those uncover anything?

@squeed squeed added the need-more-info More information is required to further debug or fix the issue. label May 17, 2024
@bhm-kyndryl
Copy link
Author

bhm-kyndryl commented May 20, 2024

Hi @squeed

The deployment of connectivity-check.yaml did not make any issue. All are running fine.
But "cilium connectivity test" from cli reported some issues. Here is a summary :

Test Report
❌ 1/46 tests failed (3/498 actions), 32 tests skipped, 0 scenarios skipped:
Test [no-unexpected-packet-drops]:
❌ no-unexpected-packet-drops/no-unexpected-packet-drops/eks-xxxxxx-noprd/ip-10-28-16-96.eu-west-1.compute.internal
❌ no-unexpected-packet-drops/no-unexpected-packet-drops/eks-xxxxxx-noprd/ip-10-28-23-136.eu-west-1.compute.internal
❌ no-unexpected-packet-drops/no-unexpected-packet-drops/eks-xxxxxx-noprd/ip-10-28-25-60.eu-west-1.compute.internal
connectivity test failed: 1 tests failed

Test [no-unexpected-packet-drops] [1/78]
Found unexpected packet drops:
"direction": "EGRESS", "reason": "No mapping for NAT masquerade"
"direction": "INGRESS", "reason": "Invalid packet"

All outputs are attached in the message "cilium-issue-32592.log".

cilium-issue-32592.log

Thanks for your help ;-)

@github-actions github-actions bot added info-completed The GH issue has received a reply from the author and removed need-more-info More information is required to further debug or fix the issue. labels May 20, 2024
@bhm-kyndryl
Copy link
Author

I noticed that a lot of tests were skipped in previous execution, due to ingress not being deployed.
I've deployed ingress controller in hostNetwork mode an re-run "cilium connectivity test".

Results has changed:

[no-unexpected-packet-drops]: No errors anymore but other issues reported

📋 Test Report
❌ 5/51 tests failed (15/513 actions), 27 tests skipped, 0 scenarios skipped:
Failed tests:
[pod-to-ingress-service]
[pod-to-ingress-service-deny-all]
[pod-to-ingress-service-deny-ingress-identity]
[pod-to-ingress-service-deny-backend-service]
[pod-to-ingress-service-allow-ingress-identity]

Attached detailed output and sysdump done just after the test

cilium-issue-32592-with-ingress.log
cilium-sysdump-20240520-141807.zip

Thanks

@lmb lmb added area/proxy Impacts proxy components, including DNS, Kafka, Envoy and/or XDS servers. sig/agent Cilium agent related. feature/k8s-gateway-api labels May 21, 2024
@lmb
Copy link
Contributor

lmb commented May 21, 2024

Gut feeling, hostNetwork might not work in tandem with chaining? Letting someone more knowledgeable chime in.

@bhm-kyndryl
Copy link
Author

bhm-kyndryl commented May 22, 2024

@lmb I disabled Gateway hostNetwork mode and issue still here
@sayboras I tried also from main and issue still here

I also tried with different version of cilium and tried also AL2023 (latest) without more success.

Here are some envoy logs with/out failure.

Pod and envoy on different nodes:

[2024-05-22 11:53:13.520][22][debug][router] [external/envoy/source/common/router/router.cc:478] [Tags: "ConnectionId":"381","StreamId":"5462284546592542596"] cluster 'kube-system/cilium-gateway-gitlab-iac-shared-https/servicemesh-mtls:servicemesh-mtls-pod-1-svc:8080' match for URL '/'
[2024-05-22 11:53:13.520][22][debug][router] [external/envoy/source/common/router/router.cc:698] [Tags: "ConnectionId":"381","StreamId":"5462284546592542596"] router decoding headers:
[2024-05-22 11:53:13.520][22][trace][http] [external/envoy/source/common/http/filter_manager.cc:539] [Tags: "ConnectionId":"381","StreamId":"5462284546592542596"] decode headers called: filter=envoy.filters.http.upstream_codec status=4
[2024-05-22 11:53:13.520][22][trace][http] [external/envoy/source/common/http/filter_manager.cc:539] [Tags: "ConnectionId":"381","StreamId":"5462284546592542596"] decode headers called: filter=envoy.filters.http.router status=1
[2024-05-22 11:53:13.520][22][trace][http] [external/envoy/source/common/http/http1/codec_impl.cc:698] [Tags: "ConnectionId":"381"] parsed 131 bytes

########### 5 SECONDS (Client receives 503 from envoy at the end) ###########

[2024-05-22 11:53:18.517][22][trace][http] [external/envoy/source/common/http/filter_manager.cc:1169] [Tags: "ConnectionId":"381","StreamId":"5462284546592542596"] encode headers called: filter=cilium.l7policy status=0
[2024-05-22 11:53:18.517][22][trace][http] [external/envoy/source/common/http/filter_manager.cc:1169] [Tags: "ConnectionId":"381","StreamId":"5462284546592542596"] encode headers called: filter=envoy.filters.http.grpc_stats status=0
[2024-05-22 11:53:18.517][22][trace][http] [external/envoy/source/common/http/filter_manager.cc:1169] [Tags: "ConnectionId":"381","StreamId":"5462284546592542596"] encode headers called: filter=envoy.filters.http.grpc_web status=0
[2024-05-22 11:53:18.517][22][trace][http] [external/envoy/source/common/http/filter_manager.cc:1354] [Tags: "ConnectionId":"381","StreamId":"5462284546592542596"] encode data called: filter=cilium.l7policy status=0
[2024-05-22 11:53:18.517][22][debug][http] [external/envoy/source/common/http/conn_manager_impl.cc:1925] [Tags: "ConnectionId":"381","StreamId":"5462284546592542596"] Codec completed encoding stream.
[2024-05-22 11:53:18.517][22][trace][connection] [external/envoy/source/common/network/connection_impl.cc:575] [Tags: "ConnectionId":"381"] socket event: 2
[2024-05-22 11:53:18.517][22][debug][router] [external/envoy/source/common/router/router.cc:1290] [Tags: "ConnectionId":"381","StreamId":"5462284546592542596"] upstream reset: reset reason: connection timeout, transport failure reason:
[2024-05-22 11:53:18.517][22][debug][http] [external/envoy/source/common/http/filter_manager.cc:996] [Tags: "ConnectionId":"381","StreamId":"5462284546592542596"] Sending local reply with details upstream_reset_before_response_started{connection_timeout}
[2024-05-22 11:53:18.517][22][debug][http] [external/envoy/source/common/http/conn_manager_impl.cc:1820] [Tags: "ConnectionId":"381","StreamId":"5462284546592542596"] encoding headers via codec (end_stream=false):
[2024-05-22 11:53:18.517][22][trace][connection] [external/envoy/source/common/network/connection_impl.cc:490] [Tags: "ConnectionId":"381"] writing 134 bytes, end_stream false
[2024-05-22 11:53:18.517][22][trace][http] [external/envoy/source/common/http/filter_manager.cc:1354] [Tags: "ConnectionId":"381","StreamId":"5462284546592542596"] encode data called: filter=envoy.filters.http.grpc_stats status=0
[2024-05-22 11:53:18.517][22][trace][http] [external/envoy/source/common/http/filter_manager.cc:1354] [Tags: "ConnectionId":"381","StreamId":"5462284546592542596"] encode data called: filter=envoy.filters.http.grpc_web status=0
[2024-05-22 11:53:18.517][22][trace][http] [external/envoy/source/common/http/conn_manager_impl.cc:1843] [Tags: "ConnectionId":"381","StreamId":"5462284546592542596"] encoding data via codec (size=91 end_stream=true)

Pod and envoy on same nodes:

[2024-05-22 11:56:32.545][22][debug][router] [external/envoy/source/common/router/router.cc:478] [Tags: "ConnectionId":"397","StreamId":"1180725351791552725"] cluster 'kube-system/cilium-gateway-gitlab-iac-shared-https/servicemesh-mtls:servicemesh-mtls-pod-1-svc:8080' match for URL '/'
[2024-05-22 11:56:32.546][22][trace][http] [external/envoy/source/common/http/filter_manager.cc:539] [Tags: "ConnectionId":"397","StreamId":"1180725351791552725"] decode headers called: filter=envoy.filters.http.upstream_codec status=4
[2024-05-22 11:56:32.546][22][trace][http] [external/envoy/source/common/http/http1/codec_impl.cc:698] [Tags: "ConnectionId":"397"] parsed 131 bytes
[2024-05-22 11:56:32.546][22][trace][router] [external/envoy/source/common/router/upstream_codec_filter.cc:70] [Tags: "ConnectionId":"397","StreamId":"1180725351791552725"] proxying headers
[2024-05-22 11:56:32.546][22][trace][http] [external/envoy/source/common/http/filter_manager.cc:539] [Tags: "ConnectionId":"397","StreamId":"1180725351791552725"] decode headers called: filter=envoy.filters.http.router status=1
[2024-05-22 11:56:32.546][22][debug][router] [external/envoy/source/common/router/upstream_request.cc:580] [Tags: "ConnectionId":"397","StreamId":"1180725351791552725"] pool ready
[2024-05-22 11:56:32.546][22][trace][http] [external/envoy/source/common/http/filter_manager.cc:68] [Tags: "ConnectionId":"397","StreamId":"1180725351791552725"] continuing filter chain: filter=0x24873edccdc0
[2024-05-22 11:56:32.547][22][trace][router] [external/envoy/source/common/router/upstream_request.cc:264] [Tags: "ConnectionId":"397","StreamId":"1180725351791552725"] upstream response headers:
[2024-05-22 11:56:32.547][22][trace][http] [external/envoy/source/common/http/filter_manager.cc:1169] [Tags: "ConnectionId":"397","StreamId":"1180725351791552725"] encode headers called: filter=envoy.filters.http.grpc_stats status=0
[2024-05-22 11:56:32.547][22][trace][http] [external/envoy/source/common/http/filter_manager.cc:1169] [Tags: "ConnectionId":"397","StreamId":"1180725351791552725"] encode headers called: filter=envoy.filters.http.grpc_web status=0
[2024-05-22 11:56:32.547][22][debug][http] [external/envoy/source/common/http/conn_manager_impl.cc:1820] [Tags: "ConnectionId":"397","StreamId":"1180725351791552725"] encoding headers via codec (end_stream=false):
[2024-05-22 11:56:32.547][22][trace][http] [external/envoy/source/common/http/filter_manager.cc:1354] [Tags: "ConnectionId":"397","StreamId":"1180725351791552725"] encode data called: filter=envoy.filters.http.grpc_web status=0
[2024-05-22 11:56:32.547][22][trace][connection] [external/envoy/source/common/network/connection_impl.cc:490] [Tags: "ConnectionId":"397"] writing 13 bytes, end_stream false
[2024-05-22 11:56:32.547][22][trace][http] [external/envoy/source/common/http/filter_manager.cc:1354] [Tags: "ConnectionId":"397","StreamId":"1180725351791552725"] encode data called: filter=cilium.l7policy status=0
[2024-05-22 11:56:32.547][22][trace][http] [external/envoy/source/common/http/filter_manager.cc:1354] [Tags: "ConnectionId":"397","StreamId":"1180725351791552725"] encode data called: filter=envoy.filters.http.grpc_stats status=0
[2024-05-22 11:56:32.547][22][trace][http] [external/envoy/source/common/http/filter_manager.cc:1354] [Tags: "ConnectionId":"397","StreamId":"1180725351791552725"] encode data called: filter=envoy.filters.http.grpc_web status=0
[2024-05-22 11:56:32.547][22][trace][http] [external/envoy/source/common/http/conn_manager_impl.cc:1843] [Tags: "ConnectionId":"397","StreamId":"1180725351791552725"] encoding data via codec (size=0 end_stream=true)

Attaching few debug logs from Hubble and envoy focused on the connection_id and stream_id (success and failure), plus our Cilium configmap.

Thanks for your help !

tests_from_localhost.txt
Pod-Envoy-not-same-node.csv.txt
Pod-Envoy-same-node.csv.txt
cilium_configmap.yaml.txt
hubble.txt

@bhm-kyndryl
Copy link
Author

I tried the following as a workaround because it seems to be inter-nodes issue, and it worked !

Node to node encryption

"encryption.enabled" = "true"
"encryption.type" = "wireguard"

I can access to the backend from any nodes with this. 100% success

@markandrus
Copy link

markandrus commented May 31, 2024

I have a very similar setup, except with kube-proxy, and I also see this issue. Toggling encryption does not work for me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/proxy Impacts proxy components, including DNS, Kafka, Envoy and/or XDS servers. feature/k8s-gateway-api info-completed The GH issue has received a reply from the author kind/bug This is a bug in the Cilium logic. kind/community-report This was reported by a user in the Cilium community, eg via Slack. needs/triage This issue requires triaging to establish severity and next steps. sig/agent Cilium agent related.
Projects
None yet
Development

No branches or pull requests

4 participants