Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

clamonacc hanging, causing high system load #245

Open
frank-fegert opened this issue Aug 10, 2021 · 8 comments
Open

clamonacc hanging, causing high system load #245

frank-fegert opened this issue Aug 10, 2021 · 8 comments

Comments

@frank-fegert
Copy link
Contributor

frank-fegert commented Aug 10, 2021

Describe the bug

We're using ClamAV (clamd with clamonacc) on RHEL 7 and are experiencing similar issues as described in
#184
The clamonacc seems to hang indefinately causing high system load (>>40 with an 8 CPU system) and will crash immediately if a strace or gdb is attached to the process.
The system this happens most oftenly is a Jenkins build server with a very large amount of files and directories dynamically being generated during builds. The paths of the files and directories seem to contain special ASCII characters (e.g. whitespaces and "@").

We've seen this issue:

  • since updating from v0.101.x
  • with the stock EPEL packages (https://src.fedoraproject.org/rpms/clamav/blob/epel7/f/clamav.spec) using TCP based communication (due to old curl version v7.29.0 on RHEL) between clamd and clamonacc
  • with custom build packages of v0.103.4 and cherry picked patches, dynamically linked against the old curl v7.29.0 using TCP based communication (due to old curl version v7.29.0 on RHEL) between clamd and clamonacc
  • with custom build packages of v0.103.4 and cherry picked patches dynamically linked against an updated curl v7.77.0
  • with custom build packages of v0.104.0 (based on d0d0a83), statically build against an updated curl v7.77.0 (as described in Statically link (new) libcurl into clamav for use on CentOS 7? #199) using "--fdpass"

How to reproduce the problem

An exact reproducer is currently not known. The issue seems to apper while creating a large number of files and directories with names containing special ASCII characters (e.g. whitespaces and "@") under the paths in OnAccessIncludePath.

Checking configuration files in /etc

Config file: clamd.d/scan.conf
------------------------------
LogFile = "/var/log/clamd/clamd.scan.log"
LogTime = "yes"
LogSyslog = "yes"
LogVerbose = "yes"
LocalSocket = "/var/run/clamd.scan/clamd.sock"
StreamMaxLength = "536870912"
MaxThreads = "50"
MaxQueue = "200"
MaxFileSize = "536870912"
OnAccessIncludePath = "<Jenkins Path>"
OnAccessExcludeUname = "clamupdate"
OnAccessPrevention = "yes"
OnAccessCurlTimeout = "20000"
OnAccessMaxThreads = "10"
OnAccessRetryAttempts = "1"

Config file: freshclam.conf
---------------------------
LogSyslog = "yes"
PrivateMirror = "<PrivateMirror>"
OnUpdateExecute = "EXIT_1"

mail/clamav-milter.conf not found

Software settings
-----------------
Version: 0.104.0-rc
Optional features supported: MEMPOOL AUTOIT_EA06 BZIP2 LIBXML2 PCRE2 ICONV JSON

Database information
--------------------
Database directory: /var/lib/clamav
main.cvd: version 61, sigs: 6607162, built on Thu Jul 15 04:39:10 2021
bytecode.cvd: version 333, sigs: 92, built on Mon Mar  8 16:21:51 2021
daily.cld: version 26258, sigs: 1967166, built on Mon Aug  9 10:18:46 2021
Total number of signatures: 8574420

Platform information
--------------------
uname: Linux 3.10.0-1160.11.1.el7.x86_64 #1 SMP Mon Nov 30 13:05:31 EST 2020 x86_64
OS: Linux, ARCH: x86_64, CPU: x86_64
Full OS version: "Red Hat Enterprise Linux Server release 7.9 (Maipo)"
zlib version: 1.2.7 (1.2.7), compile flags: a9
platform id: 0x0a2182820800000000040805

Build information
-----------------
GNU C: 4.8.5 20150623 (Red Hat 4.8.5-44) (4.8.5)
sizeof(void*) = 8
Engine flevel: 130, dconf: 130

ClamOnAccess prozess limits:

# cat 91299.limits
Limit                     Soft Limit           Hard Limit           Units
Max cpu time              unlimited            unlimited            seconds
Max file size             unlimited            unlimited            bytes
Max data size             unlimited            unlimited            bytes
Max stack size            8388608              unlimited            bytes
Max core file size        0                    unlimited            bytes
Max resident set          unlimited            unlimited            bytes
Max processes             31766                31766                processes
Max open files            65000                65000                files
Max locked memory         65536                65536                bytes
Max address space         unlimited            unlimited            bytes
Max file locks            unlimited            unlimited            locks
Max pending signals       31766                31766                signals
Max msgqueue size         819200               819200               bytes
Max nice priority         0                    0
Max realtime priority     0                    0
Max realtime timeout      unlimited            unlimited            us

ClamOnAccess prozess status:

# cat 91299.status
Name:   clamonacc
Umask:  0022
State:  S (sleeping)
Tgid:   91299
Ngid:   0
Pid:    91299
PPid:   1
TracerPid:      0
Uid:    0       0       0       0
Gid:    0       0       0       0
FDSize: 128
Groups:
VmPeak:  1019996 kB
VmSize:  1017940 kB
VmLck:         0 kB
VmPin:         0 kB
VmHWM:     95436 kB
VmRSS:     93532 kB
RssAnon:           92988 kB
RssFile:             544 kB
RssShmem:              0 kB
VmData:   950768 kB
VmStk:       132 kB
VmExe:       672 kB
VmLib:     12488 kB
VmPTE:       420 kB
VmSwap:        0 kB
Threads:        12
SigQ:   0/31766
SigPnd: 0000000000000000
ShdPnd: 0000000000000000
SigBlk: fffffffe7ffbb035
SigIgn: 0000000000001000
SigCgt: 0000000180004e02
CapInh: 0000000000000000
CapPrm: 0000001fffffffff
CapEff: 0000001fffffffff
CapBnd: 0000001fffffffff
CapAmb: 0000000000000000
NoNewPrivs:     0
Seccomp:        0
Speculation_Store_Bypass:       thread vulnerable
Cpus_allowed:   ffffffff,ffffffff,ffffffff,ffffffff
Cpus_allowed_list:      0-127
Mems_allowed:   00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000001
Mems_allowed_list:      0
voluntary_ctxt_switches:        1875819
nonvoluntary_ctxt_switches:     267

ClamOnAccess core dump gdb info:

# gdb /usr/sbin/clamonacc clamonacc.core.91299
[...]
Reading symbols from /usr/sbin/clamonacc...Reading symbols from /usr/lib/debug/usr/sbin/clamonacc.debug...done.
done.
[New LWP 91307]
[New LWP 91309]
[New LWP 91310]
[New LWP 91311]
[New LWP 91312]
[New LWP 91313]
[New LWP 91314]
[New LWP 91315]
[New LWP 91316]
[New LWP 91317]
[New LWP 91318]
[New LWP 91299]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `/usr/sbin/clamonacc'.
#0  0x00007fd4d0770a35 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
Missing separate debuginfos, use: debuginfo-install bzip2-libs-1.0.6-13.el7.x86_64 glibc-2.17-324.el7_9.x86_64 json-c-0.11-4.el7_0.x86_64 keyutils-libs-1.5.8-3.el7.x86_64 krb5-libs-1.15.1-50.el7.x86_64 libcom_err-1.42.9-19.el7.x86_64 libgcc-4.8.5-44.el7.x86_64 libselinux-2.5-15.el7.x86_64 libssh2-1.8.0-4.el7.x86_64 libxml2-2.9.1-6.el7.5.x86_64 openssl-libs-1.0.2k-21.el7_9.x86_64 pcre-8.32-17.el7.x86_64 pcre2-10.23-2.el7.x86_64 xz-libs-5.2.2-1.el7.x86_64 zlib-1.2.7-19.el7_9.x86_64

ClamOnAccess gdb thread info:

(gdb) info threads
  Id   Target Id         Frame
  12   Thread 0x7fd4d1389840 (LWP 91299) 0x00007fd4d0a76a33 in select () from /lib64/libc.so.6
  11   Thread 0x7fd4beffd700 (LWP 91318) 0x00007fd4d0770a35 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
  10   Thread 0x7fd4bf7fe700 (LWP 91317) 0x00007fd4d0770a35 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
  9    Thread 0x7fd4bffff700 (LWP 91316) 0x00007fd4d0770a35 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
  8    Thread 0x7fd4c4ff9700 (LWP 91315) 0x00007fd4d0770a35 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
  7    Thread 0x7fd4c57fa700 (LWP 91314) 0x00007fd4d0770a35 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
  6    Thread 0x7fd4c5ffb700 (LWP 91313) 0x00007fd4d0770a35 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
  5    Thread 0x7fd4c67fc700 (LWP 91312) 0x00007fd4d0770a35 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
  4    Thread 0x7fd4c6ffd700 (LWP 91311) 0x00007fd4d0770a35 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
  3    Thread 0x7fd4c77fe700 (LWP 91310) 0x00007fd4d0770a35 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
  2    Thread 0x7fd4c7fff700 (LWP 91309) 0x00007fd4d0770a35 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
* 1    Thread 0x7fd4cd2d6700 (LWP 91307) 0x00007fd4d0770a35 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0

ClamOnAccess gdb backtrace:

(gdb) bt
#0  0x00007fd4d0770a35 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00005590a7a9f0e4 in onas_consume_event (thpool=0x7fd4c0000900) at /usr/src/debug/clamav-0.104.0/clamonacc/scan/onas_queue.c:193
#2  onas_scan_queue_th (arg=<optimized out>) at /usr/src/debug/clamav-0.104.0/clamonacc/scan/onas_queue.c:172
#3  0x00007fd4d076cea5 in start_thread () from /lib64/libpthread.so.0
#4  0x00007fd4d0a7f9fd in clone () from /lib64/libc.so.6

(gdb) bt full
#0  0x00007fd4d0770a35 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
No symbol table info available.
#1  0x00005590a7a9f0e4 in onas_consume_event (thpool=0x7fd4c0000900) at /usr/src/debug/clamav-0.104.0/clamonacc/scan/onas_queue.c:193
        popped_node = <optimized out>
#2  onas_scan_queue_th (arg=<optimized out>) at /usr/src/debug/clamav-0.104.0/clamonacc/scan/onas_queue.c:172
        __clframe = {__cancel_routine = <optimized out>, __cancel_arg = 0x0, __do_it = 1, __cancel_type = <optimized out>}
        ctx = <optimized out>
        sigset = {__val = {18446744067267081013, 18446744073709551615 <repeats 15 times>}}
        thpool = 0x7fd4c0000900

ClamOnAccess gdb variable info:

(gdb) p onas_scan_queue_empty_cond
$1 = {__data = {__lock = 0, __futex = 3849157, __total_seq = 1924579, __wakeup_seq = 1924578, __woken_seq = 1924578, __mutex = 0x5590a7d351e0 <onas_queue_lock>, __nwaiters = 2, __broadcast_seq = 0},
  __size = "\000\000\000\000Ż:\000\343]\035\000\000\000\000\000\342]\035\000\000\000\000\000\342]\035\000\000\000\000\000\340Qӧ\220U\000\000\002\000\000\000\000\000\000",
  __align = 16532003432169472}
(gdb) p onas_queue_lock
$2 = {__data = {__lock = 0, __count = 0, __owner = 0, __nusers = 1, __kind = 0, __spins = 0, __elision = 0, __list = {__prev = 0x0, __next = 0x0}},
  __size = '\000' <repeats 12 times>, "\001", '\000' <repeats 26 times>, __align = 0}

Attachments

Core dump is available if helpful.

@goshansp
Copy link
Contributor

@frank-fegert are you perhabs able to reproduce this using rpmbuild --rebuild <random>.srpm like: https://gitlab.com/goshansp/ansible-role-clamav/-/blob/wip_load_test_etc/tests/workload_development.yml ... i think the high load is a feature. last unstableness i've seen on dev/104 was MaxThreads not having an effect spinning them up, in most cases only 1 or 2 of them causing load on just one core seen by top --p <clamonacc_pid> - H - are your threads spinning up properly? thank you for the detailed report, i've learned alot from it and appreciate your effort.

@frank-fegert
Copy link
Contributor Author

@goshansp thanks for the fast reply!
If i'm doing a synthetic load test on the monitored filesystem/directory with e.g.:

time for FL in $(seq -s ' ' 1 99999); do echo "TEST $FL TEST $FL" >> <ClamOnAcc monitored directory>/test_${FL}.tst; done

real    2m53.938s
user    0m3.537s
sys     0m5.541s

i can see the CPU utilization:

  • on the individual (13) clamonacc threads being 0.0% - 1.7% with a slightly skewed distribution towards one thread.
  • on the (3) clamd threads being >70% for one and <<5% for the other two threads.

So yes, i can confim that there is a skewed distribution in the CPU utilization, but more with Clamd than on the side of ClamOnAcc 😉

Unfortunately, even with the above synthetic load test i wasn't able to trigger the behaviour of clamonacc hanging on "onas_consume_event" ☹️

@m-sola
Copy link
Contributor

m-sola commented Aug 12, 2021

@frank-fegert

Have you tested against the latest 104 or 103.4 branches? We put in a fix very recently (i.e. post rc build) that very likely addresses this issue given the backtrace provided and behaviour seen.

Edit: nevermind, I see now you've built off of that commit. I'll take a closer look at this

@frank-fegert
Copy link
Contributor Author

@m-sola thanks for the fast reply!

As i've written above in the last bullet point, i've tested d0d0a83 from the main branch. This should have the fix for #184 and should be the same as b3315ec which i guess you're referring to?

The backtraces shown are from a recent case of the hang with ClamAV build from d0d0a83. I've still got core dumps from the previous tests and versions (see the other bullet points above) if this helps.

@frank-fegert
Copy link
Contributor Author

On the last three occasions we've experienced the hanging behaviour of clamonacc which i've described above, we managed to get the following messages in the clamonacc logfile:

ClamInotif: MOVED_TO - adding <subdirectory> to <directory> with wd:121315
Clamonacc: onas_clamonacc_exit(), signal 11
ERROR: Clamonacc: clamonacc has experienced a fatal error, if you continue to see this error, please run clamonacc with --verbose and report the issue and crash report to the developers
Clamonacc: attempting to stop ddd thread ...
ClamInotif: onas_ddd_exit()
ClamInotif: stopped

The clamonacc is already running with the --verbose switch. <subdirectory> and <directory> were different on all three occasions.

Unfortunately i'm not able to pinpoint the thread or the code location where the SIGSEGV actually happens ☹️

Any ideas on how to narrow this issue down further?

@frank-fegert
Copy link
Contributor Author

Hello all,

so i've finally found some time to brush up on my C / pthreads skills and sink my teeth into this issue (vacation, november rain and lots of 🍰 really do help with the debugging process 😉).

I think - but could very well be mislead - i was able to isolate the underlying root cause of this issue, but besides a quick'n'dirty workaround i'm not sure on what the best way to properly fix this issue would be.

Basically the issue revolves around the ClamInotif thread of clamonacc not handling the special inotify kernel events and in this case specifically the IN_Q_OVERFLOW event. The watch descriptor returned by a IN_Q_OVERFLOW event looks like this example debug output:

ClamInotif: inotify event wd:-1, mask:16384, cookie:0, length:0, child:(null)

As a result the strlen(path) in https://github.com/Cisco-Talos/clamav/blob/main/clamonacc/inotif/inotif.c#L640 causes a SIGSEGV. This by itself could easily be fixed. There is on the other hand the more overall issue, that after a queue overflow event the applications cache (here ddd_ht and wdlt) should be considered invalid and be rebuild (see https://man7.org/linux/man-pages/man7/inotify.7.html, Section "Limitations and caveats" second to last paragraph), because inotify events were lost. This seems a bit more tricky to me.

I was able to create a reproducers shell script (attached file break_clamonacc_ddd_thread.txt) to - at least on my system - reliably trigger the issue. But since the issue appears to be a tad load dependent, your mileage may vary.

Along the way of debugging this issue i've come across some other things i've found odd, which are currently adressed as local patches. Are pull requests preferred for discussing them or should i drop them in here?

Thanks & best regards,

Frank

@frank-fegert
Copy link
Contributor Author

#!/bin/bash

#DRY_RUN="echo"
BASEDIR=""
DUMMY_PATH="somearbitrary_string_just_to_make_the_pathname_longer"
DUMMY_FILE="somearbitrary_string_just_to_make_the_filename_longer"

if [[ $# -ne 1 ]]; then
  echo "ERROR: No base directory monitored by clamonacc given."
  exit 1
else
  BASEDIR=$1
  if [[ ! -d "${BASEDIR}" ]]; then
    echo "ERROR: Provided path monitored by clamonacc \"${BASEDIR}\" not a directory."
    exit 1
  fi
fi

echo ${BASEDIR}/clamonacc_inotify_wached_directory_{0,1,2,3,4,5,6,7,8,9}/clamonacc_inotify_wached_sub_directory_{0,1,2,3,4,5,6,7,8,9}/${DUMMY_PATH}/${DUMMY_PATH}/${DUMMY_PATH}/${DUMMY_PATH}/${DUMMY_PATH}/${DUMMY_PATH}_{a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q,r,s,t,u,v,w,x,y,z}{0,1,2,3,4,5,6,7,8,9} | xargs -P 0 ${DRY_RUN} mkdir -p
${DRY_RUN} rm -rf ${BASEDIR}/clamonacc_inotify_wached_directory_*
exit 0

@frank-fegert
Copy link
Contributor Author

Added a workaround to PR #386 for another SIGSEGV source in the clamonacc-ddd thread, which was observed during further analysis.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants