ingresses stop working every now and then #10054

myyc · 2024-05-01T07:47:00Z

Environmental Info:
K3s Version:

k3s version v1.29.3+k3s1 (8aecc26)
go version go1.21.8

Node(s) CPU architecture, OS, and Version:

Linux n1 6.1.0-13-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.55-1 (2023-09-29) x86_64 GNU/Linux
Linux n2 6.1.0-18-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.76-1 (2024-02-01) x86_64 GNU/Linux

Cluster Configuration:

2 nodes. 1 server, 1 agent

Describe the bug:
some ingresses stop working after a while, it seems like there are networking issues between the nodes. i have no firewall configured between them and they can otherwise talk to each other. restarting coredns and the agent on node n2 (always that one) fixes things temporarily.

Steps To Reproduce:

Install K3s on the first node (the default way)
Install K3s on the second node (using the token, etc.)
Configure a few services
Wait a few hours

This sounds more like an issue with my configuration than a bug. Any clue how to debug this? Should i just wipe the configuration in n2 and reinstall?

The text was updated successfully, but these errors were encountered:

dereknola · 2024-05-01T22:36:01Z

You would need to provide k3s configuration and logs during the event for us to help you. There isn't enough for us to act on.

myyc · 2024-05-02T07:45:10Z

which k3s configuration exactly, as in which files?

i checked k3s-agent's logs and there isn't anything meaningful, e.g. yesterday logs stopped at midnight, when everything worked fine, but as soon as i restarted k3s-agent, this appeared:

May 02 08:41:20 n2 systemd[1]: k3s-agent.service: Found left-over process 2497 (containerd-shim) in control group while starting unit. Ignoring.
May 02 08:41:20 n2 systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
May 02 08:41:20 n2 systemd[1]: k3s-agent.service: Found left-over process 2920 (containerd-shim) in control group while starting unit. Ignoring.
May 02 08:41:20 n2 systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
May 02 08:41:20 n2 systemd[1]: k3s-agent.service: Found left-over process 3191 (containerd-shim) in control group while starting unit. Ignoring.
May 02 08:41:20 n2 systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
May 02 08:41:20 n2 systemd[1]: k3s-agent.service: Found left-over process 3525 (containerd-shim) in control group while starting unit. Ignoring.
May 02 08:41:20 n2 systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
May 02 08:41:20 n2 systemd[1]: k3s-agent.service: Found left-over process 4335 (containerd-shim) in control group while starting unit. Ignoring.
May 02 08:41:20 n2 systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
May 02 08:41:20 n2 systemd[1]: k3s-agent.service: Found left-over process 79756 (containerd-shim) in control group while starting unit. Ignoring.
May 02 08:41:20 n2 systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.

any clue? any more logs i should inspect?

brandond · 2024-05-06T22:03:42Z

Please attach the complete logs from the time period in question. Those messages are all from systemd, not k3s. They are normal to see, as container processes remain running while k3s itself is stopped.

myyc · 2024-05-08T11:02:59Z

can you tell me which logs specifically? as per the FAQ:

i'm running from systemd (so rc and command line are irrelevant)
pod logs aren't useful as the pods have no issues (i can port-forward)
containerd logs match systemd log

i'm sort of blind right now, i am trying to connect to a specific ingress, it says 404 page not found and i can't really see any info in the logs i'm checking. the only (non-realtime) message i see is in the traefik pod logs, e.g.

time="2024-05-08T10:58:41Z" level=error msg="Skipping service: no endpoints found" serviceName=blahblah namespace=stuff servicePort="&ServiceBackendPort{Name:,Number:8000,}" providerName=kubernetes ingress=blahblah

brandond · 2024-05-08T17:43:07Z

OK, well "can't connect to" is not really the same as "get a 404 response from". In this case you have specific logs from traefik indicating that there are no endpoints for that service, so you'd want to check on the pods backing that service and see why they're not ready.

myyc · 2024-05-08T20:36:43Z

i mentioned before that pods and services are fine, i can port forward and access the service without issues. the issue isn't always the same. earlier on it was 404, now it's gateway timeout. i just restarted k3s-agent again and it's all fine.

i'll ask again. what is the correct way to debug this?

brandond · 2024-05-08T21:08:59Z

Pretty much just standard linux/kubernetes stuff...

Journald logs - k3s on the servers, k3s-agent on agents
Pod events - kubectl describe pod -n <NAMESPACE> <POD>, check for events, restarts, failed health checks, and so on
Check service endpoints - kubectl describe service -n <NAMESPACE> <SERVICE>

Note that you will probably need to catch this very close in time to when you're unable to reach the site via the ingress.

For some reason the service's endpoints are going away at times. I get that you can port-forward to it and such, but you need to figure out why the endpoints are occasionally being removed from the service. This usually indicates that the pods are failing health-checks or are being restarted or recreated for some other reason.

myyc · 2024-05-09T10:55:53Z

they're not really "occasionally" removed. they always are. but it only applies to those that are on that node. once that happens they will stay that way until i restart k3s-agent on said node. anyway, thanks for the help. i'll investigate.

github-actions · 2024-06-23T20:05:26Z

This repository uses a bot to automatically label issues which have not had any activity (commit/comment/label) for 45 days. This helps us manage the community issues better. If the issue is still relevant, please add a comment to the issue so the bot can remove the label and we know it is still valid. If it is no longer relevant (or possibly fixed in the latest release), the bot will automatically close the issue in 14 days. Thank you for your contributions.

myyc · 2024-06-25T08:55:14Z

for posterity, i made a few changes, the most important being moving a few volumes out of the big nfs share and fixing the dns servers on the two nodes instead of getting whichever the network was providing. both of these changes fixed most of the issues as it could be that constantly reading and writing to nfs was causing network congestion.

the "gateway timeout" out of nowhere issue still came out for a couple of deployments and once again it got sorted by restarting k3s on the agent node, but at least this time it's probably obvious that the problem isn't related to k3s.

github-actions bot added the status/stale label Jun 23, 2024

myyc closed this as completed Jun 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ingresses stop working every now and then #10054

ingresses stop working every now and then #10054

myyc commented May 1, 2024

dereknola commented May 1, 2024

myyc commented May 2, 2024

brandond commented May 6, 2024

myyc commented May 8, 2024 •

edited

Loading

brandond commented May 8, 2024

myyc commented May 8, 2024

brandond commented May 8, 2024 •

edited

Loading

myyc commented May 9, 2024

github-actions bot commented Jun 23, 2024

myyc commented Jun 25, 2024

ingresses stop working every now and then #10054

ingresses stop working every now and then #10054

Comments

myyc commented May 1, 2024

dereknola commented May 1, 2024

myyc commented May 2, 2024

brandond commented May 6, 2024

myyc commented May 8, 2024 • edited Loading

brandond commented May 8, 2024

myyc commented May 8, 2024

brandond commented May 8, 2024 • edited Loading

myyc commented May 9, 2024

github-actions bot commented Jun 23, 2024

myyc commented Jun 25, 2024

myyc commented May 8, 2024 •

edited

Loading

brandond commented May 8, 2024 •

edited

Loading