Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

netdata state container: runaway FD use #372

Open
withinboredom opened this issue Jul 10, 2023 · 3 comments
Open

netdata state container: runaway FD use #372

withinboredom opened this issue Jul 10, 2023 · 3 comments

Comments

@withinboredom
Copy link

Netdata needed to be removed due to consuming ALL available file descriptors (mildly entertaining that this isn't a monitored metric in netdata, that I could find).

From lsof it appears that it is just opening the WAL/db in a loop:


netdata   1781992 1782297 RRDCONTEX              201   15r      CHR                1,3         0t0          6 /dev/null
lsof: no pwd entry for UID 201
netdata   1781992 1782297 RRDCONTEX              201   16u     sock                0,8         0t0   62179404 protocol: UNIX-STREAM
lsof: no pwd entry for UID 201
netdata   1781992 1782297 RRDCONTEX              201   17ur     REG              0,121     4587520   14044310 /var/cache/netdata/netdata-meta.db
lsof: no pwd entry for UID 201
netdata   1781992 1782297 RRDCONTEX              201   18u      REG              0,121     4906952   14044311 /var/cache/netdata/netdata-meta.db-wal
lsof: no pwd entry for UID 201
netdata   1781992 1782297 RRDCONTEX              201   19ur     REG              0,121       32768   14044312 /var/cache/netdata/netdata-meta.db-shm
lsof: no pwd entry for UID 201
netdata   1781992 1782297 RRDCONTEX              201   20ur     REG              0,121        4096   14044313 /var/cache/netdata/context-meta.db
lsof: no pwd entry for UID 201
netdata   1781992 1782297 RRDCONTEX              201   21u      REG              0,121       37112   14044314 /var/cache/netdata/context-meta.db-wal
lsof: no pwd entry for UID 201
netdata   1781992 1782297 RRDCONTEX              201   22ur     REG              0,121       32768   14044315 /var/cache/netdata/context-meta.db-shm
lsof: no pwd entry for UID 201
netdata   1781992 1782297 RRDCONTEX              201   23ur     REG              0,121     4587520   14044310 /var/cache/netdata/netdata-meta.db
lsof: no pwd entry for UID 201
netdata   1781992 1782297 RRDCONTEX              201   24u      REG              0,121     4906952   14044311 /var/cache/netdata/netdata-meta.db-wal
lsof: no pwd entry for UID 201
netdata   1781992 1782297 RRDCONTEX              201   25u  a_inode               0,14           0      12713 [eventpoll]
lsof: no pwd entry for UID 201
netdata   1781992 1782297 RRDCONTEX              201   26r     FIFO               0,13         0t0   62177496 pipe
lsof: no pwd entry for UID 201
netdata   1781992 1782297 RRDCONTEX              201   27w     FIFO               0,13         0t0   62177496 pipe
lsof: no pwd entry for UID 201
netdata   1781992 1782297 RRDCONTEX              201   28u  a_inode               0,14           0      12713 [eventfd]
lsof: no pwd entry for UID 201
netdata   1781992 1782297 RRDCONTEX              201   29u  a_inode               0,14           0      12713 [eventpoll]
lsof: no pwd entry for UID 201
netdata   1781992 1782297 RRDCONTEX              201   30r     FIFO               0,13         0t0   62171736 pipe
lsof: no pwd entry for UID 201
netdata   1781992 1782297 RRDCONTEX              201   31w     FIFO               0,13         0t0   62171736 pipe
lsof: no pwd entry for UID 201
netdata   1781992 1782297 RRDCONTEX              201   32u  a_inode               0,14           0      12713 [eventfd]
lsof: no pwd entry for UID 201
netdata   1781992 1782297 RRDCONTEX              201   33r     FIFO               0,13         0t0   62176463 pipe
lsof: no pwd entry for UID 201
netdata   1781992 1782297 RRDCONTEX              201   34u     sock                0,8         0t0   62161843 protocol: UDPv6
lsof: no pwd entry for UID 201
netdata   1781992 1782297 RRDCONTEX              201   35w     FIFO               0,13         0t0   62166944 pipe
lsof: no pwd entry for UID 201
netdata   1781992 1782297 RRDCONTEX              201   36u     sock                0,8         0t0   62161844 protocol: UDP
lsof: no pwd entry for UID 201
netdata   1781992 1782297 RRDCONTEX              201   37u     sock                0,8         0t0   62161847 protocol: TCPv6
lsof: no pwd entry for UID 201
netdata   1781992 1782297 RRDCONTEX              201   38u     sock                0,8         0t0   62161848 protocol: TCP
lsof: no pwd entry for UID 201
netdata   1781992 1782297 RRDCONTEX              201   39u     sock                0,8         0t0   62178506 protocol: UNIX-STREAM
lsof: no pwd entry for UID 201
netdata   1781992 1782297 RRDCONTEX              201   40u  a_inode               0,14           0      12713 [eventfd]
lsof: no pwd entry for UID 201
netdata   1781992 1782297 RRDCONTEX              201   41u  a_inode               0,14           0      12713 [eventpoll]
lsof: no pwd entry for UID 201
netdata   1781992 1782297 RRDCONTEX              201   42r     FIFO               0,13         0t0   62178505 pipe
lsof: no pwd entry for UID 201
netdata   1781992 1782297 RRDCONTEX              201   43w     FIFO               0,13         0t0   62178505 pipe
lsof: no pwd entry for UID 201
netdata   1781992 1782297 RRDCONTEX              201   44r      CHR                1,3         0t0          6 /dev/null
lsof: no pwd entry for UID 201
netdata   1781992 1782297 RRDCONTEX              201   45r     FIFO               0,13         0t0   62160878 pipe
lsof: no pwd entry for UID 201
netdata   1781992 1782297 RRDCONTEX              201   46w     FIFO               0,13         0t0   62160878 pipe
lsof: no pwd entry for UID 201
netdata   1781992 1782297 RRDCONTEX              201   47r      REG              0,273           0   62169795 /proc/1/task/209/stat
lsof: no pwd entry for UID 201
netdata   1781992 1782297 RRDCONTEX              201   48u     sock                0,8         0t0   62194554 protocol: TCP
lsof: no pwd entry for UID 201
netdata   1781992 1782298 REPLAY[1]              201  cwd       DIR              0,121        4096   13901928 /etc/netdata
lsof: no pwd entry for UID 201
netdata   1781992 1782298 REPLAY[1]              201  rtd       DIR              0,121        4096   14034715 /
lsof: no pwd entry for UID 201
netdata   1781992 1782298 REPLAY[1]              201  txt       REG              0,121     6986240   13902420 /usr/sbin/netdata
lsof: no pwd entry for UID 201
netdata   1781992 1782298 REPLAY[1]              201  mem-r     REG              0,121               14044315 /var/cache/netdata/context-meta.db-shm (stat: No such file or directory)
lsof: no pwd entry for UID 201
netdata   1781992 1782298 REPLAY[1]              201  mem-r     REG              0,121               14044312 /var/cache/netdata/netdata-meta.db-shm (stat: No such file or directory)
lsof: no pwd entry for UID 201
netdata   1781992 1782298 REPLAY[1]              201  mem       REG              0,121               13896738 /usr/lib/libzstd.so.1.5.5 (stat: No such file or directory)
lsof: no pwd entry for UID 201
netdata   1781992 1782298 REPLAY[1]              201  mem       REG              0,121               13896436 /usr/lib/libgcc_s.so.1 (stat: No such file or directory)
lsof: no pwd entry for UID 201
netdata   1781992 1782298 REPLAY[1]              201  mem       REG              0,121               13896627 /usr/lib/libstdc++.so.6.0.30 (stat: No such file or directory)
lsof: no pwd entry for UID 201
netdata   1781992 1782298 REPLAY[1]              201  mem       REG              0,121               13896405 /usr/lib/libbson-1.0.so.0.0.0 (stat: No such file or directory)
lsof: no pwd entry for UID 201
netdata   1781992 1782298 REPLAY[1]              201  mem       REG              0,121               13896522 /usr/lib/libmongoc-1.0.so.0.0.0 (stat: No such file or directory)
lsof: no pwd entry for UID 201
netdata   1781992 1782298 REPLAY[1]              201  mem       REG              0,121               13896615 /usr/lib/libsnappy.so.1.1.10 (stat: No such file or directory)
lsof: no pwd entry for UID 201
netdata   1781992 1782298 REPLAY[1]              201  mem       REG              0,121               13896595 /usr/lib/libprotobuf.so.32.0.12 (stat: No such file or directory)
lsof: no pwd entry for UID 201
netdata   1781992 1782298 REPLAY[1]              201  mem       REG              0,121               13896491 /usr/lib/libjson-c.so.5.2.0 (stat: No such file or directory)
lsof: no pwd entry for UID 201
netdata   1781992 1782298 REPLAY[1]              201  mem       REG              0,121               13896512 /usr/lib/liblz4.so.1.9.4 (stat: No such file or directory)
lsof: no pwd entry for UID 201
netdata   1781992 1782298 REPLAY[1]              201  mem       REG              0,121               13896695 /usr/lib/libuv.so.1.0.0 (stat: No such file or directory)
lsof: no pwd entry for UID 201
netdata   1781992 1782298 REPLAY[1]              201  mem       REG              0,121               13895904 /lib/libuuid.so.1.3.0 (stat: No such file or directory)
lsof: no pwd entry for UID 201
netdata   1781992 1782298 REPLAY[1]              201  mem       REG              0,121               11672957 /lib/libssl.so.3 (stat: No such file or directory)
lsof: no pwd entry for UID 201
netdata   1781992 1782298 REPLAY[1]              201  mem       REG              0,121               11672956 /lib/libcrypto.so.3 (stat: No such file or directory)
lsof: no pwd entry for UID 201
netdata   1781992 1782298 REPLAY[1]              201  mem       REG              0,121               11672959 /lib/libz.so.1.2.13 (stat: No such file or directory)
lsof: no pwd entry for UID 201
netdata   1781992 1782298 REPLAY[1]              201  mem       REG              0,121               11672951 /lib/ld-musl-x86_64.so.1 (stat: No such file or directory)
lsof: no pwd entry for UID 201
netdata   1781992 1782298 REPLAY[1]              201    0w      CHR                1,3         0t0          6 /dev/null
lsof: no pwd entry for UID 201
netdata   1781992 1782298 REPLAY[1]              201    1w     FIFO               0,13         0t0   62178474 pipe
lsof: no pwd entry for UID 201
netdata   1781992 1782298 REPLAY[1]              201    2w     FIFO               0,13         0t0   62178474 pipe
lsof: no pwd entry for UID 201
netdata   1781992 1782298 REPLAY[1]              201    3u     sock                0,8         0t0   62173481 protocol: UNIX
lsof: no pwd entry for UID 201
netdata   1781992 1782298 REPLAY[1]              201    4w     FIFO               0,13         0t0   62178474 pipe
lsof: no pwd entry for UID 201
netdata   1781992 1782298 REPLAY[1]              201    5w     FIFO               0,13         0t0   62178474 pipe
lsof: no pwd entry for UID 201
netdata   1781992 1782298 REPLAY[1]              201    6w      REG              0,121           0   14044307 /var/log/netdata/health.log
lsof: no pwd entry for UID 201
netdata   1781992 1782298 REPLAY[1]              201    7u     sock                0,8         0t0   62173487 protocol: TCP
lsof: no pwd entry for UID 201
netdata   1781992 1782298 REPLAY[1]              201    8u     sock                0,8         0t0   62173488 protocol: TCPv6
lsof: no pwd entry for UID 201
netdata   1781992 1782298 REPLAY[1]              201    9u  a_inode               0,14           0      12713 [eventpoll]
lsof: no pwd entry for UID 201
netdata   1781992 1782298 REPLAY[1]              201   10r     FIFO               0,13         0t0   62179402 pipe
lsof: no pwd entry for UID 201
netdata   1781992 1782298 REPLAY[1]              201   11w     FIFO               0,13         0t0   62179402 pipe
lsof: no pwd entry for UID 201
netdata   1781992 1782298 REPLAY[1]              201   12r     FIFO               0,13         0t0   62179403 pipe
lsof: no pwd entry for UID 201
netdata   1781992 1782298 REPLAY[1]              201   13w     FIFO               0,13         0t0   62179403 pipe
lsof: no pwd entry for UID 201
netdata   1781992 1782298 REPLAY[1]              201   14u  a_inode               0,14           0      12713 [eventfd]
lsof: no pwd entry for UID 201
netdata   1781992 1782298 REPLAY[1]              201   15r      CHR                1,3         0t0          6 /dev/null
lsof: no pwd entry for UID 201
netdata   1781992 1782298 REPLAY[1]              201   16u     sock                0,8         0t0   62179404 protocol: UNIX-STREAM
lsof: no pwd entry for UID 201
netdata   1781992 1782298 REPLAY[1]              201   17ur     REG              0,121     4587520   14044310 /var/cache/netdata/netdata-meta.db
lsof: no pwd entry for UID 201
netdata   1781992 1782298 REPLAY[1]              201   18u      REG              0,121     4906952   14044311 /var/cache/netdata/netdata-meta.db-wal
lsof: no pwd entry for UID 201
netdata   1781992 1782298 REPLAY[1]              201   19ur     REG              0,121       32768   14044312 /var/cache/netdata/netdata-meta.db-shm
lsof: no pwd entry for UID 201
netdata   1781992 1782298 REPLAY[1]              201   20ur     REG              0,121        4096   14044313 /var/cache/netdata/context-meta.db
lsof: no pwd entry for UID 201
netdata   1781992 1782298 REPLAY[1]              201   21u      REG              0,121       37112   14044314 /var/cache/netdata/context-meta.db-wal
lsof: no pwd entry for UID 201
netdata   1781992 1782298 REPLAY[1]              201   22ur     REG              0,121       32768   14044315 /var/cache/netdata/context-meta.db-shm
lsof: no pwd entry for UID 201
netdata   1781992 1782298 REPLAY[1]              201   23ur     REG              0,121     4587520   14044310 /var/cache/netdata/netdata-meta.db
lsof: no pwd entry for UID 201
netdata   1781992 1782298 REPLAY[1]              201   24u      REG              0,121     4906952   14044311 /var/cache/netdata/netdata-meta.db-wal

approximately ^ 100's of thousands of times until the node eventually becomes unavailable due to an inability to open any more file descriptors.

@ilyam8
Copy link
Member

ilyam8 commented Jul 17, 2023

Hey, @withinboredom.

approximately ^ 100's of thousands of times until the node

  • How did you come to this conclusion?
  • Can you show lsof output with a lot of open WAL/db files?
  • And can you show the "Applications -> disk -> apps.files" chart when there are a lot of open WAL/db files? Netdata tracks the number of open files for application groups.
screenshot Screenshot 2023-07-17 at 20 03 03

@ktsaou
Copy link
Member

ktsaou commented Jul 18, 2023

@withinboredom I am very sorry you had this bad experience with Netdata.

Please help us find the issue and fix it.

In the current nightly version of netdata we added the 2 more monitoring functions, based on your suggestions at netdata/netdata#15411

  1. apps.plugin monitors the open file descriptors of all processes and raises alerts
  2. proc.plugin monitors the total file descriptors of the system and raises alerts.

But even before these changes, as @ilyam8 says, we were monitoring the file descriptors per application with apps.plugin. Keep in mind that apps.plugin monitors the file descriptors of applications from /proc, so even for netdata processes it monitors the them, as understood by the kernel.

From the output of lsof you posted above, we don't see any leaks in file descriptors.

So, although I understand you have removed Netdata from your systems, could you please help us trace the issue?

I have also read your blog post. You state that somehow a process managed to exhaust all file descriptors of the system and this led to a system-wide corruption. To my understanding this is not technically possible. Even if an application is leaking file descriptors, the limits of the process of far below the limits of the system. So, the process will start misbehaving but it cannot kill or corrupt the entire system. The ability of a process to corrupt the entire system, would be a big issue for Linux systems.

Anyway, please help us verify that Netdata is leaking file descriptors. If it does, we need to find where it does. You mention wal, but we don't see it in the lsof you posted.

Can you help us?

@withinboredom
Copy link
Author

withinboredom commented Jul 18, 2023

I ended up installing netdata back, but directly on the nodes instead of using the helm chart. I really like netdata and couldn't find anything nearly as awesome. I lose some internal monitoring of the cluster, but that's ok with me for now.

From the output of lsof you posted above, we don't see any leaks in file descriptors.

It was essentially as I posted, hundreds and hundreds of thousands of lines of it (~500k). I probably should have gotten you the entire output, but didn't think of it at the time. As you can see from the output, it is entirely netdata that has these open files. This isn't filtered for netdata, it's raw output of lsof on the node.

At first, I thought it ran out of space on the volume, but post-mortem, the volume only had about 20mb being used (out of 1gb). I have no idea why netdata did this.

You mention wal, but we don't see it in the lsof you posted.

You may have to scroll to the right to read the end of the lines (github doesn't wrap code blocks, apparently).

Can you help us?

Yeah, absolutely. With this new monitoring, I'd feel much safer installing the helm chart again.

I dug through the nodes logs. Here are some things I saw, before running out of file descriptors:

Jul 04 20:55:44 cameo kernel: netdata[20554]: segfault at 30 ip 00007f19a348a2f9 sp 00007f19a0fab510 error 4 in ld-musl-x86_64.so.1[7f19a3447000+4c000]

There are many logs like that before the system fails due to no more FD. Does a segfault leave open files behind if the segfault happens in a container? I know PID 1 normally is responsible for reaping processes, but I don't know the structure of these containers (if there isn't a PID 1 in the container, that could be it ... 🤔).

Eventually, it looks like etcd loses access to FDs first (well, "first" might be that it is just the most active process in the node), followed by netdata attempting to ask k8s for a configmap repeatedly. Then, k8s itself. Eventually, there are enough processes stuck with too many FDs that it overwhelms the system.

At the time, there were only a handful of containers/processes on this node.

I'll see if I can get the full logs here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants