[Filebeat] Journald causes Filebeat to crash #34077

Aqualie · 2022-12-20T00:39:04Z

Version: 8.5.3
Operating System: 6.0.12-arch1-1
Steps to Reproduce: Enable the IPTables (input: journald) plugin in the elastic-agent wait a few minutes and it crashes
Crash logs: filebeat.log

dlouzan · 2023-01-24T12:40:15Z

We at Siemens are also experiencing exactly this error and have tracked it to the journald file rotation. In our logs, we can correlate the two events:

$ cat /etc/fedora-release
Fedora release 36 (Thirty Six)

$ dnf list --installed filebeat
Installed Packages
filebeat.x86_64                                         8.5.1-1                                          @@commandline

$ journalctl -u filebeat -u systemd-journald

...

systemd-journald[538]: Field hash table of /var/log/journal/ec2a83e928b0ba8fda4772c954d0d37d/system.journal has a fill level at 78.4 (261 of 333 items), suggesting rotation.
systemd-journald[538]: /var/log/journal/ec2a83e928b0ba8fda4772c954d0d37d/system.journal: Journal header limits reached or header out-of-date, rotating.

...

filebeat[1323847]: fatal error: unexpected signal during runtime execution
filebeat[1323847]: [signal SIGBUS: bus error code=0x2 addr=0x79572ec00098 pc=0x795778ce39c4]
filebeat[1323847]: runtime stack:
filebeat[1323847]: runtime.throw({0x61cde86dd0cb?, 0x760d102f0c38?})
filebeat[1323847]:         /usr/local/go/src/runtime/panic.go:992 +0x71
filebeat[1323847]: runtime.sigpanic()
filebeat[1323847]:         /usr/local/go/src/runtime/signal_unix.go:802 +0x225
filebeat[1323847]: goroutine 30 [syscall]:
filebeat[1323847]: runtime.cgocall(0x61cde8683bf0, 0xc000cc1058)
filebeat[1323847]:         /usr/local/go/src/runtime/cgocall.go:157 +0x5c fp=0xc000cc1010 sp=0xc000cc0fd8 pc=0x61cde5f6045c
filebeat[1323847]: github.com/coreos/go-systemd/v22/sdjournal._Cfunc_my_sd_journal_next(0x760d102f1900, 0x760cec005530)
filebeat[1323847]:         _cgo_gotypes.go:351 +0x50 fp=0xc000cc1058 sp=0xc000cc1010 pc=0x61cde8660e30
filebeat[1323847]: github.com/coreos/go-systemd/v22/sdjournal.(*Journal).Next.func1(0x760d102f1900, 0xc000cb0080)
filebeat[1323847]:         /go/pkg/mod/github.com/coreos/go-systemd/v22@v22.3.2/sdjournal/journal.go:579 +0xa8 fp=0xc000cc10c0 sp=0xc000cc1058 pc=0x61cde86646e8
filebeat[1323847]: github.com/coreos/go-systemd/v22/sdjournal.(*Journal).Next(0xc000cb0080)
filebeat[1323847]:         /go/pkg/mod/github.com/coreos/go-systemd/v22@v22.3.2/sdjournal/journal.go:579 +0x10b fp=0xc000cc11b8 sp=0xc000cc10c0 pc=0x61cde86644ab
filebeat[1323847]: github.com/elastic/beats/v7/filebeat/input/journald/pkg/journalread.(*Reader).Next(0xc000be63c0, {0x760d12f73708, 0xc0004467c0})

...

The filebeat crash does not always happen when journald rotation is triggered, the rotation is not a sufficient condition, but it's necessary. On busy hosts were journald rotates faster, the correlation is almost 100%.

I have also found this related issue in systemd: systemd/systemd#24320 So it's unclear to me atm whether this is filebeat not handling properly the expected SIGBUS signal that can be delivered by the go-systemd library, the library itself not behaving, or a fundamental issue with how systemd works.

dlouzan · 2023-02-10T12:19:42Z

Bump the issue for hopefully a reaction of the maintainers, or a quick assessment, this is literally crashing dozens of times per day on busy hosts 🙇

I did not find a policy for tagging / mentioning, could you perhaps help @ph with assessment?

Thanks in advance for any input!

utilitynerd · 2023-09-20T23:28:09Z

I'm seeing the same issue on Debian 12 (not technically supported yet) and Filebeat 8.10.1.

dlouzan · 2023-11-17T11:36:38Z

Checking upstream, in theory systemd/systemd#29456 is solving this, but I do not even see it added to systemd v255 rc2, so it will probably be a while until we can verify this or whether they'll backport it to previous releases.

It'd be great if somebody that runs a cutting edge setup could confirm 😇

belimawr · 2024-03-25T11:03:00Z

I've been trying to reproduce this issue today and I can't get it to happen.

Following the linked issues, I ended up using systemd/systemd#24320 (comment) to try reproducing it.

I've tried two distros so far:

Archlinux, journald/systemd 255 (255.4-2-arch)
Ubunutu 2004, journald/systemd 245 (245.4-4ubuntu3.23)

I'll look more into it, probably trying with Fedora as well.

dlouzan · 2024-03-25T11:28:45Z

@belimawr We get the crashes regularly (verified our logs just now again) on fedora 38 & amazon linux 2023 hosts. We only see it on busy hosts, which makes sense as this is some kind of race condition.

belimawr · 2024-03-25T14:07:25Z

Thanks for the quick reply @dlouzan ! I'll try those distros and see if I can reproduce it.

Do you have any idea of the throughput of messages in the journald logs? Currently I'm working with about 20k ~ 30k events per minute on the systems I mentioned before.

Which version of Filebeat are you currently using?

dlouzan · 2024-03-25T14:40:45Z

@belimawr Both kinds of hosts are using latest stable dnf packages 8.12.2-1; at the time of the last crash that I can see, the node was delivering ~10k events from 11 log files every 30s. Not much of a difference on the delivered stats after the restart following the crash:

{
    "message": "Non-zero metrics in the last 30s",
    "service.name": "filebeat",
    "monitoring": {
        "metrics": {
            "beat": {
                "cgroup": {
                    "memory": {
                        "mem": {
                            "usage": {
                                "bytes": 91836416
                            }
                        }
                    }
                },
...
                "handles": {
                    "limit": {
                        "hard": 65535,
                        "soft": 65535
                    },
                    "open": 116
                },
                "info": {
                    "ephemeral_id": "1fe23dc6-4d58-47c3-8c77-fdfdaaa8c143",
                    "uptime": {
                        "ms": 6810097
                    },
                    "version": "8.12.2"
                },
...
            },
            "filebeat": {
                "events": {
                    "active": 775,
                    "added": 10007,
                    "done": 9601
                },
                "harvester": {
                    "open_files": 11,
                    "running": 11
                }
            },
            "libbeat": {
...
                "output": {
                    "events": {
                        "acked": 9600,
                        "active": 0,
                        "batches": 6,
                        "total": 9600
                    },
                    "read": {
                        "bytes": 440
                    },
                    "write": {
                        "bytes": 1841900
                    }
                },
                "pipeline": {
                    "clients": 17,
                    "events": {
                        "active": 775,
                        "filtered": 1,
                        "published": 10006,
                        "total": 10007
                    },
                    "queue": {
                        "acked": 9600
                    }
                }
            },
            "registrar": {
                "states": {
                    "current": 20,
                    "update": 8524
                },
                "writes": {
                    "success": 6,
                    "total": 6
                }
            },
            "system": {
                "load": {
                    "1": 5.98,
                    "15": 5.45,
                    "5": 5.33,
                    "norm": {
                        "1": 0.3738,
                        "15": 0.3406,
                        "5": 0.3331
                    }
                }
            }
        },
        "ecs.version": "1.6.0"
    }
}

belimawr · 2024-03-25T14:45:53Z

Thanks!

belimawr · 2024-03-25T16:26:25Z

I can confirm I can reproduce the crash and the time I noticed it happening was when the journald was rotating its logs.

Interestingly enough, journalctl --rotate does not trigger it. There might be something else than just a rotation playing a role there.

Vesrions (IP redacted):

[root@ip-42.42.42.42 ec2-user]# filebeat  version
filebeat version 8.12.2 (amd64), libbeat 8.12.2 [0b71acf2d6b4cb6617bff980ed6caf0477905efa built 2024-02-15 13:39:15 +0000 UTC]
[root@ip-42.42.42.42 ec2-user]# uname -a
Linux ip-42.42.42.42.eu-central-1.compute.internal 6.1.79-99.167.amzn2023.x86_64 #1 SMP PREEMPT_DYNAMIC Tue Mar 12 18:15:55 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
[root@ip-42.42.42.42 ec2-user]# journalctl --version
systemd 252 (252.16-1.amzn2023.0.2)
+PAM +AUDIT +SELINUX -APPARMOR +IMA +SMACK +SECCOMP -GCRYPT -GNUTLS +OPENSSL +ACL +BLKID +CURL +ELFUTILS +FIDO2 +IDN2 -IDN -IPTC +KMOD +LIBCRYPTSETUP +LIBFDISK +PCRE2 +PWQUALITY +P11KIT +QRENCODE +TPM2 -BZIP2 -LZ4 +XZ -ZLIB -ZSTD +BPF_FRAMEWORK +XKBCOMMON +UTMP +SYSVINIT default-hierarchy=unified
[root@ip-42.42.42.42 ec2-user]#

How to reproduce

Get an Amazon 2023 Linux host
Install Filebeat using yum as described in our documentation

Use the following configuration (/etc/filebeat/filebeat.yml)

/etc/filebeat/filebeat.yml

filebeat.inputs:
- type: journald
  id: journald-input

output.elasticsearch:
  hosts:
    - https://my-awesome-deployment.elastic-cloud.com:443
  username: elastic
  password: a-very-secret-password

Start the service: systemctl enable --now filebeat
Start watching Filebeat logs journalctl -fu filebeat

Generate a high load of logs sent to journald so rotation is triggered often. The following script is from Journal-signal SIGBUS error systemd/systemd#24320 (comment). I ran it as root.

crash.py

#!/usr/bin/env python3

import logging
import logging.handlers
import random
import string

my_logger = logging.getLogger('MyLogger')
my_logger.setLevel(logging.DEBUG)

handler = logging.handlers.SysLogHandler(address = '/dev/log')

my_logger.addHandler(handler)

animals = ["cat", "dog", "crab"]
padding = ''.join(random.choices(string.ascii_lowercase, k=1024))

while True:
    for animal in animals:
        my_logger.critical(f"Rubbish - {animal} food is running short. Rubbish: " + padding)

Wait until the crash happens. On my tests it seemed to only happen after the maximum number of rotated files was reaches. Once that was reached, every rotation triggered the crash.

belimawr · 2024-03-27T15:51:58Z

I've also tried to reproduce it in a VM running Archlinux and the crash does not reproduce, it uses a newer version of Journald/Systemd (255):

systemd 255 (255.4-2-arch)
+PAM +AUDIT -SELINUX -APPARMOR -IMA +SMACK +SECCOMP +GCRYPT +GNUTLS +OPENSSL +ACL +BLKID +CURL +ELFUTILS +FIDO2 +IDN2 -IDN +IPTC +KMOD +LIBCRYPTSETUP +LIBFDISK +PCRE2 +PWQUALITY +P11KIT +QRENCODE +TPM2 +BZIP2 +LZ4 +XZ +ZLIB +ZSTD +BPF_FRAMEWORK +XKBCOMMON +UTMP -SYSVINIT default-hierarchy=unified

It really looks like the crash is not caused by Filebeat, but by Journald/go-systemd.

dlouzan · 2024-03-28T16:49:17Z

@belimawr Perhaps the efforts should go into supporting the backport of the supposed fix into systemd v252, which is the stable version in multiple distributions: systemd/systemd-stable#356 🙇

belimawr · 2024-05-01T21:58:03Z

I was looking at this issue again in a more structured way, and this time I can confirm that the crash is not related to Filebeat, even a Filebeat build with a newer version of Journald still experiences the same crash.

I also experienced the crashes described by:

Filbeat stops shipping journald logs when encountered "failed to read message field: bad message" error #32782
[Journald] input crashes with "failed to read message field: cannot allocate memory" #39352

They happen intermittently with the SIGBUS error. All while flooding Journald with logs, thus forcing a quick log rotation.

elasticmachine · 2024-05-05T15:50:42Z

Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane)

belimawr · 2024-05-06T19:56:51Z

I've tried backporting systemd/systemd-stable#356 to v252, but just the patch from the original PR is not enough because other parts of the code have diverged. I tried to fix some obvious ones, but my C is pretty rusty and my knowledge of the systemd codebase is zero :/

Anyway, here is my attempt: https://github.com/belimawr/systemd-stable/tree/v252-stable

belimawr · 2024-05-06T20:01:43Z

The AmazonLinux issue about it also seems pretty stale. I added a comment here but I don't have high hopes. amazonlinux/amazon-linux-2023#608 (comment)

belimawr · 2024-05-10T14:29:42Z

I did some investigation tying to recover from the panic caused by systemd and, unfortunately, it's not possible to recover from it :/

When a SIGBUS is sent due to an error in program execution the Go runtime converts it into a run-time panic that we cannot recover from on our code.

From the Go docs:

Synchronous signals are signals triggered by errors in program execution: SIGBUS, SIGFPE, and SIGSEGV. These are only considered synchronous when caused by program execution, not when sent using os.Process.Kill or the kill program or some similar mechanism. In general, except as discussed below, Go programs will convert a synchronous signal into a run-time panic.

belimawr · 2024-05-13T12:57:51Z

@cmacknz, @jlind23 I think we should close this task as there isn't anything else we can do on our side. At least on a code level.

We could add something about compatible versions of systemd in our documentation, but aside from it I can't see anything else we could do at the moment.

jlind23 · 2024-05-13T13:47:18Z

@belimawr I am fine closing it with a "won't fix" status then.

cmacknz · 2024-05-13T14:04:56Z

We could add something about compatible versions of systemd in our documentation, but aside from it I can't see anything else we could do at the moment.

Yes we need something like this, otherwise this error will lead to support cases for us. I mean it already effectively is in the issue tracker and it isn't GA yet.

Can we detect the systemd version at runtime and refuse to run with a detailed error if it is the version with this bug? That seems preferable to letting us be killed by SIGBUS.

belimawr · 2024-05-13T19:53:02Z

Can we detect the systemd version at runtime and refuse to run with a detailed error if it is the version with this bug? That seems preferable to letting us be killed by SIGBUS.

I'll look into that. Worst case scenario we can exec.Command("journalctl", "--version") and parse the output.

belimawr · 2024-05-16T19:34:59Z

PR adding validation to the Systemd version to prevent Filebeat from crashing: #39605

botelastic bot added the needs_team Indicates that the issue/PR needs a Team:* label label Dec 20, 2022

cmacknz added the Team:Elastic-Agent Label for the Agent team label Oct 26, 2023

botelastic bot removed the needs_team Indicates that the issue/PR needs a Team:* label label Oct 26, 2023

cmacknz mentioned this issue Oct 26, 2023

Datasets system.auth and system.syslog are not available for Debian 12 under Data streams tab. elastic/elastic-agent#3650

Open

cmacknz mentioned this issue Nov 10, 2023

GA support for reading from journald #37086

Open

3 tasks

pierrehilbert added the bug label Feb 6, 2024

belimawr self-assigned this Mar 25, 2024

belimawr mentioned this issue May 1, 2024

[Journald] input crashes with "failed to read message field: cannot allocate memory" #39352

Open

belimawr mentioned this issue May 1, 2024

Filbeat stops shipping journald logs when encountered "failed to read message field: bad message" error #32782

Open

pierrehilbert added the Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team label May 5, 2024

belimawr linked a pull request May 16, 2024 that will close this issue

Only start Journald input with supported systemd versions #39605

Draft

9 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Filebeat] Journald causes Filebeat to crash #34077

[Filebeat] Journald causes Filebeat to crash #34077

Aqualie commented Dec 20, 2022

dlouzan commented Jan 24, 2023 •

edited

dlouzan commented Feb 10, 2023

utilitynerd commented Sep 20, 2023

dlouzan commented Nov 17, 2023

belimawr commented Mar 25, 2024

dlouzan commented Mar 25, 2024

belimawr commented Mar 25, 2024 •

edited

dlouzan commented Mar 25, 2024

belimawr commented Mar 25, 2024

belimawr commented Mar 25, 2024

belimawr commented Mar 27, 2024

dlouzan commented Mar 28, 2024

belimawr commented May 1, 2024

elasticmachine commented May 5, 2024

belimawr commented May 6, 2024

belimawr commented May 6, 2024

belimawr commented May 10, 2024

belimawr commented May 13, 2024

jlind23 commented May 13, 2024

cmacknz commented May 13, 2024 •

edited

belimawr commented May 13, 2024

belimawr commented May 16, 2024

[Filebeat] Journald causes Filebeat to crash #34077

[Filebeat] Journald causes Filebeat to crash #34077

Comments

Aqualie commented Dec 20, 2022

dlouzan commented Jan 24, 2023 • edited

dlouzan commented Feb 10, 2023

utilitynerd commented Sep 20, 2023

dlouzan commented Nov 17, 2023

belimawr commented Mar 25, 2024

dlouzan commented Mar 25, 2024

belimawr commented Mar 25, 2024 • edited

dlouzan commented Mar 25, 2024

belimawr commented Mar 25, 2024

belimawr commented Mar 25, 2024

How to reproduce

belimawr commented Mar 27, 2024

dlouzan commented Mar 28, 2024

belimawr commented May 1, 2024

elasticmachine commented May 5, 2024

belimawr commented May 6, 2024

belimawr commented May 6, 2024

belimawr commented May 10, 2024

belimawr commented May 13, 2024

jlind23 commented May 13, 2024

cmacknz commented May 13, 2024 • edited

belimawr commented May 13, 2024

belimawr commented May 16, 2024

dlouzan commented Jan 24, 2023 •

edited

belimawr commented Mar 25, 2024 •

edited

cmacknz commented May 13, 2024 •

edited