Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

clamonacc fatal error #184

Closed
goshansp opened this issue Jun 30, 2021 · 8 comments
Closed

clamonacc fatal error #184

goshansp opened this issue Jun 30, 2021 · 8 comments
Assignees

Comments

@goshansp
Copy link
Contributor

goshansp commented Jun 30, 2021

issue

clamonacc dev/104 crashes with below ERROR when a folder like i.e. clamav.git-repo is copied and we aproach of about ~10000 files opened. this issue occurs on all kernel versions. the default ulimit=1024 needs to be raised for this to occur.

ERROR: Clamonacc: clamonacc has experienced a fatal error, if you continue to see this error, please run clamonacc with --verbose and report the issue and crash report to the developers

reproduction steps

# raise root ulimit
[vagrant@rhel8 ~]$ sudo vim /etc/security/limits.conf # add following two lines 
root    soft    nofile  100000
root    hard    nofile  100000

# verify root ulimit
[vagrant@centos7 git]$ sudo su -
[root@centos7 ~]# ulimit -n
100000


# starting clamonacc
[vagrant@rhel8 ~]$ sudo git/clamav-devel/build/clamonacc/clamonacc -F --config-file=/etc/clamd.d/clamd.conf --stream --verbose

# wrecking havoc:
[vagrant@rhel8 ~]$ git clone https://github.com/Cisco-Talos/clamav.git
[vagrant@rhel8 ~]$ cp clamav-devel target_git_folder -r
[vagrant@rhel8 ~]$ cat /proc/sys/fs/file-nr
# repeat above folder copy until it `signal 11` around 11k

conf

TCPSocket 3310
TCPAddr 127.0.0.1
TemporaryDirectory /tmp/clamav
OnAccessExcludeUname clamscan
OnAccessIncludePath /home
OnAccessIncludePath /usr
OnAccessIncludePath /etc
DatabaseDirectory /var/lib/clamav

furher observations

log

ClamFanotif: attempting to feed consumer queue
Clamonacc: onas_clamonacc_exit(), signal 11
ERROR: Clamonacc: clamonacc has experienced a fatal error, if you continue to see this error, please run clamonacc with --verbose and report the issue and crash report to the developers            
Clamonacc: attempting to stop ddd thread ...
ClamInotif: onas_ddd_exit()
ClamInotif: stopped
Clamonacc: attempting to stop event consumer thread ...
ClamScanQueue: onas_scan_queue_exit()

Please let me know if there's anything for me to test. Any advice appreciated. Thanks.

@goshansp
Copy link
Contributor Author

confirmed as well on dev/0.103.3 and rel/0.103. how do my binaries (fatal error) differ from the packaged rpms 0.103.2 (working)? what is different in the build of https://src.fedoraproject.org/rpms/clamav/blob/rawhide/f/clamav.spec ?

@micahsnyder
Copy link
Contributor

I don't see anything that would explain it. 🤷‍♂️ 🙁

@goshansp
Copy link
Contributor Author

goshansp commented Jul 1, 2021

still occurs without OnAccessExtraScanning yes so maybe originating rather from fanotify side ... issue reproduces reliably on refrence rhel 8 and fedora 34 (kernel 5.11) as well.

@goshansp
Copy link
Contributor Author

goshansp commented Jul 5, 2021

cat /proc/sys/fs/file-nr will rise until ~11000 when we get Clamonacc: onas_clamonacc_exit(), signal 11

@m-sola m-sola self-assigned this Jul 6, 2021
@goshansp
Copy link
Contributor Author

after switching to a custom fork provided by @m-sola and centos/fdpass/unixsocket i cannot get clamonacc to crash anymore. unless i am missing something we should implement #199 which would allow us to move from stream/tcp to fdpass/unixsocket which might allow us to close this issue. unfortunately our POC has a gate on 27.07. which we aim to pass.

@goshansp
Copy link
Contributor Author

#199 has been poc'd and its statical linkage seems to provide a robust workaround for the moment. a sustainable fix on the --stream fd leak would be much appreciated.

m-sola pushed a commit to m-sola/clamav that referenced this issue Jul 27, 2021
This fixes a fatal issue that would occur when unable to queue events due to
clamonacc improperly using all available fds.

It also fixes the core fd socket leak issue at the heart of the segfault by
properly cleaning up after a failed curl connection.
@m-sola
Copy link
Contributor

m-sola commented Jul 27, 2021

FYI, this is fixed with #227

You can test by following instructions listed above while monitoring the proc fd dir for clamonacc while it's running.

e.g.

ps -C clamonacc -o pid=

ls -l /proc/insert_result_of_ps_command_here/fd

Run the ls command after the clamonacc log shows failed connections to clamd. Previously, clamonacc would leak fds on failed connection and you would be able to see the sockets by monitoring proc. With the fix, you can see via proc that these fds are correctly cleaned up.

m-sola pushed a commit to m-sola/clamav that referenced this issue Jul 30, 2021
This fixes a fatal issue that would occur when unable to queue events due to
clamonacc improperly using all available fds.

It also fixes the core fd socket leak issue at the heart of the segfault by
properly cleaning up after a failed curl connection.

Lastly, worst case recovery code now allows more time for consumer queue
to catchup. It accomplishes this by increasing wait time and adding
retry logic.

More info: Cisco-Talos#184
micahsnyder pushed a commit that referenced this issue Aug 6, 2021
This fixes a fatal issue that would occur when unable to queue events due to
clamonacc improperly using all available fds.

It also fixes the core fd socket leak issue at the heart of the segfault by
properly cleaning up after a failed curl connection.

Lastly, worst case recovery code now allows more time for consumer queue
to catchup. It accomplishes this by increasing wait time and adding
retry logic.

More info: #184
micahsnyder pushed a commit that referenced this issue Aug 6, 2021
This fixes a fatal issue that would occur when unable to queue events due to
clamonacc improperly using all available fds.

It also fixes the core fd socket leak issue at the heart of the segfault by
properly cleaning up after a failed curl connection.

Lastly, worst case recovery code now allows more time for consumer queue
to catchup. It accomplishes this by increasing wait time and adding
retry logic.

More info: #184
m-sola pushed a commit to m-sola/clamav that referenced this issue Aug 9, 2021
This fixes a fatal issue that would occur when unable to queue events due to
clamonacc improperly using all available fds.

It also fixes the core fd socket leak issue at the heart of the segfault by
properly cleaning up after a failed curl connection.

Lastly, worst case recovery code now allows more time for consumer queue
to catchup. It accomplishes this by increasing wait time and adding
retry logic.

More info: Cisco-Talos#184
micahsnyder pushed a commit that referenced this issue Aug 9, 2021
This fixes a fatal issue that would occur when unable to queue events due to
clamonacc improperly using all available fds.

It also fixes the core fd socket leak issue at the heart of the segfault by
properly cleaning up after a failed curl connection.

Lastly, worst case recovery code now allows more time for consumer queue
to catchup. It accomplishes this by increasing wait time and adding
retry logic.

More info: #184
@micahsnyder
Copy link
Contributor

Here's the (now merged) PR for 0.103.4: #244

micahsnyder pushed a commit that referenced this issue Nov 3, 2021
This fixes a fatal issue that would occur when unable to queue events due to
clamonacc improperly using all available fds.

It also fixes the core fd socket leak issue at the heart of the segfault by
properly cleaning up after a failed curl connection.

Lastly, worst case recovery code now allows more time for consumer queue
to catchup. It accomplishes this by increasing wait time and adding
retry logic.

More info: #184
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants