-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Filebeat] Journald causes Filebeat to crash #34077
Comments
We at Siemens are also experiencing exactly this error and have tracked it to the journald file rotation. In our logs, we can correlate the two events:
The filebeat crash does not always happen when journald rotation is triggered, the rotation is not a sufficient condition, but it's necessary. On busy hosts were journald rotates faster, the correlation is almost 100%. I have also found this related issue in systemd: systemd/systemd#24320 So it's unclear to me atm whether this is filebeat not handling properly the expected |
Bump the issue for hopefully a reaction of the maintainers, or a quick assessment, this is literally crashing dozens of times per day on busy hosts 🙇 I did not find a policy for tagging / mentioning, could you perhaps help @ph with assessment? Thanks in advance for any input! |
I'm seeing the same issue on Debian 12 (not technically supported yet) and Filebeat 8.10.1. |
Checking upstream, in theory systemd/systemd#29456 is solving this, but I do not even see it added to systemd v255 rc2, so it will probably be a while until we can verify this or whether they'll backport it to previous releases. It'd be great if somebody that runs a cutting edge setup could confirm 😇 |
I've been trying to reproduce this issue today and I can't get it to happen. Following the linked issues, I ended up using systemd/systemd#24320 (comment) to try reproducing it. I've tried two distros so far:
I'll look more into it, probably trying with Fedora as well. |
@belimawr We get the crashes regularly (verified our logs just now again) on fedora 38 & amazon linux 2023 hosts. We only see it on busy hosts, which makes sense as this is some kind of race condition. |
Thanks for the quick reply @dlouzan ! I'll try those distros and see if I can reproduce it. Do you have any idea of the throughput of messages in the journald logs? Currently I'm working with about 20k ~ 30k events per minute on the systems I mentioned before. Which version of Filebeat are you currently using? |
@belimawr Both kinds of hosts are using latest stable dnf packages {
"message": "Non-zero metrics in the last 30s",
"service.name": "filebeat",
"monitoring": {
"metrics": {
"beat": {
"cgroup": {
"memory": {
"mem": {
"usage": {
"bytes": 91836416
}
}
}
},
...
"handles": {
"limit": {
"hard": 65535,
"soft": 65535
},
"open": 116
},
"info": {
"ephemeral_id": "1fe23dc6-4d58-47c3-8c77-fdfdaaa8c143",
"uptime": {
"ms": 6810097
},
"version": "8.12.2"
},
...
},
"filebeat": {
"events": {
"active": 775,
"added": 10007,
"done": 9601
},
"harvester": {
"open_files": 11,
"running": 11
}
},
"libbeat": {
...
"output": {
"events": {
"acked": 9600,
"active": 0,
"batches": 6,
"total": 9600
},
"read": {
"bytes": 440
},
"write": {
"bytes": 1841900
}
},
"pipeline": {
"clients": 17,
"events": {
"active": 775,
"filtered": 1,
"published": 10006,
"total": 10007
},
"queue": {
"acked": 9600
}
}
},
"registrar": {
"states": {
"current": 20,
"update": 8524
},
"writes": {
"success": 6,
"total": 6
}
},
"system": {
"load": {
"1": 5.98,
"15": 5.45,
"5": 5.33,
"norm": {
"1": 0.3738,
"15": 0.3406,
"5": 0.3331
}
}
}
},
"ecs.version": "1.6.0"
}
} |
Thanks! |
I can confirm I can reproduce the crash and the time I noticed it happening was when the journald was rotating its logs. Interestingly enough, Vesrions (IP redacted):
How to reproduce
|
I've also tried to reproduce it in a VM running Archlinux and the crash does not reproduce, it uses a newer version of Journald/Systemd (255):
It really looks like the crash is not caused by Filebeat, but by Journald/go-systemd. |
@belimawr Perhaps the efforts should go into supporting the backport of the supposed fix into systemd v252, which is the stable version in multiple distributions: systemd/systemd-stable#356 🙇 |
I was looking at this issue again in a more structured way, and this time I can confirm that the crash is not related to Filebeat, even a Filebeat build with a newer version of Journald still experiences the same crash. I also experienced the crashes described by:
They happen intermittently with the SIGBUS error. All while flooding Journald with logs, thus forcing a quick log rotation. |
Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane) |
I've tried backporting systemd/systemd-stable#356 to Anyway, here is my attempt: https://github.com/belimawr/systemd-stable/tree/v252-stable |
The AmazonLinux issue about it also seems pretty stale. I added a comment here but I don't have high hopes. amazonlinux/amazon-linux-2023#608 (comment) |
I did some investigation tying to recover from the panic caused by systemd and, unfortunately, it's not possible to recover from it :/ When a SIGBUS is sent due to an error in program execution the Go runtime converts it into a run-time panic that we cannot recover from on our code. From the Go docs:
|
@belimawr I am fine closing it with a "won't fix" status then. |
Yes we need something like this, otherwise this error will lead to support cases for us. I mean it already effectively is in the issue tracker and it isn't GA yet. Can we detect the systemd version at runtime and refuse to run with a detailed error if it is the version with this bug? That seems preferable to letting us be killed by SIGBUS. |
I'll look into that. Worst case scenario we can |
PR adding validation to the Systemd version to prevent Filebeat from crashing: #39605 |
Crash logs: filebeat.log
The text was updated successfully, but these errors were encountered: