New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
clamd becomes zombie process after some time #330
Comments
|
The same here in GKE environment. |
|
also running into this issue in EKS |
|
Same happens to me:
Both happens in the container. |
|
I experienced the same in a Kubernetes cluster. To debug this, I started the POD with It shows that the process received a SIGKILL signal (which is because it exceeded its memory limits in my case). But there seems to be bug in the Docker image here because in an OOM situation, the container should be killed (and restarted). But instead the I don't know For the meantime, I configured a LivenessProbe in Kubernetes using the |
|
@jacobrayschwartz @denis111 @ysmall-backbase: Per the comment above, it may just be that you need to allocate more RAM for your containers. Either that or disable the |
@micahsnyder there is no health check issue for me, it works fine :D. Kubernetes does not support Docker health checks and so one has to define their own in Kubernetes but this is ok. livenessProbe:
exec:
command:
- clamdcheck.sh
initialDelaySeconds: 120
failureThreshold: 2
periodSeconds: 30But I still see an issue here: if the Health checks help to detect and handle the situation. Anyway, in this case they just hide the underlying root cause that the container becomes unusable. |
|
I've seen an similar issue at startup time when the virus database is out of date (>7 days old) with AKS.
If I don't use an volume for /var/lib/clamav I get the same issue on each start reliably because the signatures in the current 0.104 image are older than 7 days. |
|
#563 I fixed it for myself by allowing freshclam to finish fetching updates before starting clamd. |
I'm not sure if this is the right solution. What this bug sounds like to me is that ClamD is becoming a zombie process when it runs out of memory because both it and Freshclam have loaded the databases. I don't know why ClamD is becoming a zombie instead of being killed, or why the container isn't killed. @andremae's comment about the 0.104 image's databases being older than 7 days is a decent reason to merge @TairaSayo's PR anyways, so it can update before ClamD starts -- but running with old databases wouldn't cause ClamD to crash or give up or anything. It's fine to load older databases. ClamD would just reload later when the self-check happens and it sees that Freshclam updated the databases. One reason I'm hesitant to merge the PR is that ClamD already takes a while to start up. Having it wait for a hard-coded 60 seconds before starting ClamD would mean a much longer start-up time. Maybe we could have it run freshclam twice, instead:
This way ClamD will only wait as long as it takes for freshclam to finish updating. The second daemonized freshclam will do the DNS query to check for an update and will find it is up-to-date and then it will sleep until it's time to check again. I think that's fine. But even with this change, I think we'll still run into this bug later when freshclam does an update while clamd is running and clamd reloads. There are two options to lower the RAM usage to resolve this:
The second one is more intrusive because it means ClamD will block for 20-60 seconds depending on how fast the host is while ClamD reloads, once a day. But it uses more RAM than freshclam's database load-testing, so it is arguably more important if you don't have a lot of RAM for your containers. Thoughts? |
|
@micahsnyder you can control ClamD delay with the CLAMD_STARTUP_DELAY variable, but I agree that it could be easily missed. Do you think it is better to lower the default value to 6-10 seconds? |
|
@TairaSayo sorry about the lag replying to you. Instead of having a startup delay, what I'm proposing is this change, so you run freshclam twice. The first time is non-daemonized and foreground (blocking). The second time is like normal. if [ "${CLAMAV_NO_FRESHCLAMD:-false}" != "true" ]; then
echo "Updating databases before starting ClamD"
freshclam \
--checks="${FRESHCLAM_CHECKS:-1}" \
--foreground \
--stdout \
--user="clamav"
fi
if [ "${CLAMAV_NO_CLAMD:-false}" != "true" ]; then
echo "Starting ClamD"
if [ -S "/run/clamav/clamd.sock" ]; then
unlink "/run/clamav/clamd.sock"
fi
clamd --foreground &
while [ ! -S "/run/clamav/clamd.sock" ]; do
if [ "${_timeout:=0}" -gt "${CLAMD_STARTUP_TIMEOUT:=1800}" ]; then
echo
echo "Failed to start clamd"
exit 1
fi
printf "\r%s" "Socket for clamd not found yet, retrying (${_timeout}/${CLAMD_STARTUP_TIMEOUT}) ..."
sleep 1
_timeout="$((_timeout + 1))"
done
echo "socket found, clamd started."
fi
if [ "${CLAMAV_NO_FRESHCLAMD:-false}" != "true" ]; then
echo "Starting FreshclamD to check for additional updates in the background"
freshclam \
--checks="${FRESHCLAM_CHECKS:-1}" \
--daemon \
--foreground \
--stdout \
--user="clamav" \
&
fiI'm also proposing changing the so that freshclam doesn't use a bunch of RAM load-testing the databases after download And changing the so that ClamD doesn't have two databases loaded in memory at the same time during a reload -- which I suspect the process that is causing the zombie clamd-process issue. You will have to modify this area to add these options https://github.com/Cisco-Talos/clamav/blob/main/Dockerfile#L56-L71 Something like this: sed -e "s|^\(Example\)|\# \1|" \
-e "s|.*\(PidFile\) .*|\1 /run/lock/clamd.pid|" \
-e "s|.*\(LocalSocket\) .*|\1 /run/clamav/clamd.sock|" \
-e "s|.*\(TCPSocket\) .*|\1 3310|" \
-e "s|.*\(TCPAddr\) .*|#\1 0.0.0.0|" \
-e "s|.*\(User\) .*|\1 clamav|" \
-e "s|^\#\(LogFile\) .*|\1 /var/log/clamav/clamd.log|" \
-e "s|^\#\(LogTime\).*|\1 yes|" \
-e "s|^\#\(ConcurrentDatabaseReload\).*|\1 no|" \
"/clamav/etc/clamav/clamd.conf.sample" > "/clamav/etc/clamav/clamd.conf" && \
sed -e "s|^\(Example\)|\# \1|" \
-e "s|.*\(PidFile\) .*|\1 /run/lock/freshclam.pid|" \
-e "s|.*\(DatabaseOwner\) .*|\1 clamav|" \
-e "s|^\#\(UpdateLogFile\) .*|\1 /var/log/clamav/freshclam.log|" \
-e "s|^\#\(NotifyClamd\).*|\1 /etc/clamav/clamd.conf|" \
-e "s|^\#\(ScriptedUpdates\).*|\1 yes|" \
-e "s|^\#\(TestDatabases\).*|\1 no|" \
"/clamav/etc/clamav/freshclam.conf.sample" > "/clamav/etc/clamav/freshclam.conf" && \What do you think? |
|
I think that instead of executing That way, if it crashes, the container will restart and will not become a zombie container. So the scenario will be like this:
|
|
@victorwedo I think I agree with you. It's still possible the freshclam daemon might die and no one would be the wiser. But that's the case right now anyways. So, what you suggest seems like an improvement to me. Anyone have any thoughts on why we shouldn't do this? @oliv3r any thoughts? |
|
So I've this container running for months, but without any memory restrictions on my containers :) I get that however, that that is a serious need for certain setups; First, I just want to re-state what @micahsnyder wrote
btw, instead of tail, we can also see if we can just start a process monitor of course, right? 'is PID of freshd still ok; is PID of clamd still ok'? and exit if not? That's just some shell magic instead of a tail :) The tail just exists to keep the container running and is ugly in itself anyhow ... Lack of RAM, will always be an issue, right? I think (at double the potential ram cost, due to freshclam also needing a lot of ram, is to launch 2 containers, one managing freshclamd, and one managing clamd; which is supported via the variables, to support both use cases. That is the main reason we have I think @eht16 makes a sensible statement:
The whole purpose of an init/tini, is to do exactly this, isnt' it? Reap processes. We could see if As for 'start freshclam first; wait for it to finish; then start clamd; and freshclamd again; is really also just ducktape on the issue isn't it? Why should clamd not function when the database is out of date? Btw, didn't we do this anyway, if we don't have any database (as clamd can't function without it). While I think it's good to make this more robust (making sure init catches the issues); I also think we should solve the problem at the core as well (avoid zombies to begin with, which granted, is a longer term goal of course). |
Hey folks,
In testing this image out for an application, I'm noticing that clamd just stops after some time and ps lists it as a zombie:
I noticed this when a container would randomly stop responding after some time. There was a failed update before the last error state we had, perhaps that's something to do with it. Perhaps there is an issue with not using
waitin the init script? See the logs below:The text was updated successfully, but these errors were encountered: