Linux ‐ Check auditd backlog limit usage

Description

This is applicable to any Linux host using auditd.

Some issue has been reported with auditd where nodes could freeze when the backlog_limit is exceeded causing either of:

Higher CPU usage and CPU time usage, causing general slowness in the nodes and even preventing the user from connecting using SSH or console.
Kernel panic, which can be configured through Security Hardening parameters ("Operating system action on auditd processing failure" - "RHEL-07-030090"). Kernel panic is triggered if auditd goes down because of any reason (including backlog_limit being exceeded) when this flag was set to TRUE or the "-f 2" flag was included in auditd (can be checked through "auditctl -s"), this sacrifices availability for the sake of security (if auditd fails someone could execute actions without them being logged)

Impact

Describe the potential risks or consequences if the detected issues are not resolved.

Root Cause

Outline common scenarios that may lead to the failure or alert.

Diagnostics

This rule examines the current usage of auditd backlog through the following command:

[root@edge-20 ~]# auditctl -s
enabled 1
failure 1
pid 823
rate_limit 0
backlog_limit 10000
lost 5318202
backlog 0
loginuid_immutable 0 unlocked
[root@edge-20 ~]#

The rule checks 2 things across 10 consecutive queries separated by 0.25s:

it compares the "backlog" value with the "backlog_limit" value, if it is 80% or more of the limit, the test will be failed. (This is done for the highest "backlog" of the queries)
it also checks the "lost" messages, and checks how much it increased across the sampling period. If it increased the test will also be failed.

Solution

A high utilization of backlog_limit buffer might be a symptom of something going on with the node, e.g. something might be failing and flooding the system with audit messages. That is what should be checked upon since it might cause the node to hang.

# To see the state of the backlog buffer and rotating audit logs:
cat /var/log/messages
journalctl -f -u auditd
auditctl -s
 
# To check which type of reports have an increased number:
aureport --start today
 
# To see the reports:
cat /var/log/audit/audit.log

As a start you can check the messages logs to determine the nodes that are being affected, and the compare the audit.log between affected and non affected nodes as to understand where are the additional auditd entries coming from.

If the node already hung, you will need to troubleshoot the specific scenario (probably a node reboot through ipmi/bmc will help to get access to it)

In some cases where the system is already running slow due to audit messages the command could timeout (30sec per each of the 10 command queries). It is worth verifying the auditd messages in this case.

Linux ‐ Check auditd backlog limit usage

Description

Impact

Root Cause

Diagnostics

Solution

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally