IP persistence storage seems to not clean up #574

Closed
rcbarnett opened this Issue Oct 17, 2013 · 23 comments

Comments

Projects
None yet
10 participants
Contributor

rcbarnett commented Oct 17, 2013

MODSEC-426: It seem that mod_security is not cleaning up the persistence storage for the IP-storage I am using.

Here is a simple test ruleset I am using (nothing else):

SecRule IP:dos_block "@eq 1"
"phase:2,block,msg:'IP address blocked due to high connection rate.',
id:2003,status:403,severity:'CRITICAL',tag:'DoS'"
SecRule SCRIPT_FILENAME "@rx /www/content/prod/dl/dl.php"
"phase:5,chain,t:none,nolog,pass,id:2001,severity:'INFO',tag:'DoS',setvar:IP.dos_counter=+1,expirevar:IP.dos_counter=60"
SecRule IP:dos_counter "@gt 100"
"t:none,setvar:IP.dos_block,setvar:!IP.dos_counter,expirevar:IP.dos_block=60"

The ruleset is working fine. But what I am noticing is, that the /var/tmp/ip.pag file is constantly growing. It is currently at >1GiB
size. When I do a "strings" on that file, I see entries like this:

__expire_KEY
1344192686
794.224.68.113_64d9d88c927c58359a3649f30a00e95070bbb8c0
TIMEOUT
3600
__key
794.224.68.113_64d9d88c927c58359a3649f30a00e95070bbb8c0
__name
CREATE_TIME
1344189037
UPDATE_COUNTER
dos_counter
__expire_dos_counter
1344189097
LAST_UPDATE_TIME
1344189086
94.224.68.113_64d9d88c927c58359a3649f30a00e95070bbb8c0
__expire_KEY
1343169406
868.167.225.250_77150c41806fda817314ba9c40e040c598830d5d
TIMEOUT
3600
__key
868.167.225.250_77150c41806fda817314ba9c40e040c598830d5d
__name
CREATE_TIME
1343165769
UPDATE_COUNTER
dos_counter
__expire_dos_counter
1343165829
LAST_UPDATE_TIME
1343165806
68.167.225.250_77150c41806fda817314ba9c40e040c598830dIR
__expire_KEY
1344005533
786.132.126.93_aa739e3aaaa1fbfc8667bb26120e9930df881a82
TIMEOUT
3600
__key
786.132.126.93_aa739e3aaaa1fbfc8667bb26120e9930df881a82
__name
[...]

Now when I convert the UNIX time strings to UTC, it shows me dates, which are already
expired since weeks. I would asume, that if an entry is expired, that the garbage collection
of mod_sec would remove the entry from the ip.pag file, to keep it small. Or am I wrong?

Any assistance would be highly appreciated.

@ghost ghost assigned zimmerle Oct 17, 2013

Contributor

rcbarnett commented Oct 17, 2013

Original reporter: wneessen

Contributor

rcbarnett commented Oct 17, 2013

wneessen: (This is actually a clone of MODSEC-325)

Hi Breno,

I am sorry that I have to re-open this old ticket (or actually clone). After reading your response I didn't further investigate this issue and simply disabled mod_security on the specific host, which resolved our problems so far.
After almost a year now, I had to re-enable mod_security with the same ruleset again (hoping that the problem was gone given that there were couple of version upgrades... unfortunately it didn't.

Here is a short timeline of what happend and what I was able to capture:

Re-Enabled mod_security with the same ruleset on Tuesday. The rules worked as expected.
On wednesday I already noticed an increase in the /var/tmp/ip.pag file.
Thurday night, the apache got unresponsive for a period of at least 10 minutes (we switched to the backup server
to not lose traffic). The server-status page (which happend to load (very slowly) after a couple of minutes), showed
tons of "L" processes". While the mod-sec debug log was filled with lots of these:

[05/Sep/2013:21:47:09 +0000] [xxxxxxx.cleverbridge.com/sid#80171c530][rid#802dbe0a0][/XXXXXXXXX/xxxxxxx-xxxxxxx-xxxxxxx-xxxxxxx/xxxxxxx.pdf][1] collections_remove_stale: Failed deleting collection (name "ip", key "80.XXX.XXX.XXX_6e1c38356769095ad76267f9edb43ddae2af1491"): Internal error
[05/Sep/2013:21:47:09 +0000] [xxxxxxx.cleverbridge.com/sid#80171c530][rid#802dac0a0][/XXXXXXXXX/xxxxxxx-xxxxxxx-xxxxxxx-xxxxxxx/xxxxxxx.msi][1] collections_remove_stale: Failed deleting collection (name "ip", key "80.XXX.XXX.XXX_6e1c38356769095ad76267f9edb43ddae2af1491"): Internal error
[05/Sep/2013:21:50:47 +0000] [xxxxxxx.cleverbridge.com/sid#801739a50][rid#802d760a0][/server-status][1] collections_remove_stale: Failed deleting collection (name "ip", key "85.XXX.XXX.XXX_6e1c38356769095ad76267f9edb43ddae2af1491"): Internal error
[05/Sep/2013:21:50:47 +0000] [xxxxxxx.cleverbridge.com/sid#801739a50][rid#802d760a0][/server-status][1] collections_remove_stale: Failed deleting collection (name "ip", key "85.XXX.XXX.XXX_6e1c38356769095ad76267f9edb43ddae2af1491"): Internal error

The /var/tmp/ip.pag file grew to a size of 7GiB:

% ls -l /var/tmp/ip.pag :-/
-rw-r----- 1 www wheel 7560283136 Sep 6 12:05 /var/tmp/ip.pag

% du -A /var/tmp/ip.pag (FreeBSD userland "du" - values in kbytes)
7383089 /var/tmp/ip.pag

% gdu -b /var/tmp/ip.pag (GNU "du" on FreeBSD)
7560283136 /var/tmp/ip.pag

The SecCollectionTimeout was set to "600".

Though this is a download server, it is not hit that hard, that it should have a collection of 7 GiB of IP-Useragent combinations given
that the values should be overwritten after 10 minutes. Do you have any idea what is going on here? As said, the problem started
again after I enabled (the latest) mod_security version on this server. It was running w/o this problem for almost a year with mod_sec
disabled.

Thanks for you help.
Winfried

Hello,

We had the same unfortunate experience.
After 2 months and a half, the file ip.pag reached the size of 25Go (with "du -b" & "ls -l")
Our webserver became very unstable, with a load going from 5 to 60 rapidly then slowly going down (all the websites were very slow)

The solution was to erase ip.pag et let modsec create a new one.

In apache error.log file, we can see lots of:

[error] [client xxx.xxx.71.148] ModSecurity: Failed deleting collection (name "ip", key "83.43.238.96_befe6113014112e0700b42d9324b0f65e7f4bd6d"): Internal error [hostname "www.yyyyyyyy.org"] [uri "/zzzzzzzzzzzzzzzz"] [unique_id "UyXZ0k9jyzIAADnhARwAAAA-"]

I added in modsecurity.conf the config:

SecCollectionTimeout 600

but we still have the same errors.
I don't think this was the problem as the ref guide says that without that, SecCollectionTimeout has a default value of 3600

Server: Debian Wheezy
Apache: 2.2.22-13+deb7u1
ModSecurity:
libapache2-modsecurity 2.6.6-6+deb7u1
modsecurity-crs 2.2.5-2

Is there any way we can help debug that problem?
Our ip.pag file already grows beyond a reasonable size in less than a day.

The only info I found about it: "Looks like this error message is from a apr function." from Breno (seems to be the one in charge to find the bug) (http://marc.info/?l=mod-security-users&m=136683562925114&w=2)

The keys found in the error messages are NOT in the ip.pag file.

[error] [client xxx.xxx.71.148] ModSecurity: Failed deleting collection (name "ip", key "83.43.238.96_befe6113014112e0700b42d9324b0f65e7f4bd6d"): Internal error [hostname "www.yyyyyyyy.org"] [uri "/zzzzzzzzzzzzzzzz"] [unique_id "UyXZ0k9jyzIAADnhARwAAAA-"]

htdbm -l -Tsdbm /tmp/ip 2>&1 | grep "83.43.238.96_befe6113014112e0700b42d9324b0f65e7f4bd6d"

It seems the problem doesn't come from modsec not able to delete keys.

After a few checks, we can see that the number of keys in ip.pag file goes up and down.
So modsec is managing correctly the db.

From apache2/persist_dbm.c (line 218)

rc = apr_sdbm_delete(dbm, key);
        if (rc != APR_SUCCESS) {
            msr_log(msr, 1, "collection_retrieve_ex: Failed deleting collection (name \"%s\", "
                "key \"%s\"): %s", log_escape(msr->mp, col_name),
                log_escape_ex(msr->mp, col_key, col_key_len), get_apr_error(msr->mp, rc));
            msr->msc_sdbm_delete_error = 1;
            goto cleanup;
        }

And in apr-util-1.5.3, in dbm/sdbm/sdbm.c (line 230),

    if ((status = getpage(db, exhash(key))) == APR_SUCCESS) {
        if (!delpair(db->pagbuf, key))
            /* ### should we define some APRUTIL codes? */
            status = APR_EGENERAL;
        else
            status = write_page(db, db->pagbuf, db->pagbno);
    }

I'd say that the status code received in modsec comes from the call to getPage that is different from APR_SUCCESS as the key doesn't exist in the db.

We changed our modesc local exceptions to not log anymore (that represents a lot of requests) and set the SecCollectionTimeout directive to 600 (it wasn't set at all before but the manual says the default value is 3600) and since, the ip.pag file stays at a good size (110Mo for now)

Maybe you should just add more granularities to your tests and errors/warnings messages.

Owner

zimmerle commented Mar 17, 2014

Hi @quenenni, thanks.

Are you interested in provide your solution in the format of a pull request?

https://help.github.com/articles/using-pull-requests

Owner

zimmerle commented Mar 24, 2014

We have made an utility that may be useful for those who are using persistent collections in sdbm files. The utility allows the users to manipulate the sdbm files used by ModSecurity, including shrinking it, by remove the expired keys or malformed content.

The utility is available here:
https://github.com/SpiderLabs/modsec-sdbm-util

@quenenni, can you check if you still have malformed content by using this utility?

Sorry for the late answer.

I did some more testing by playing with the "SecCollectionTimeout" directive and by reducing the logging ('nolog' for lots of rules I did).

I can already say:
With SecCollectionTimeout 600 and lots of 'nolog', my ip.pag file grew up to 430Mo (6 days)
With SecCollectionTimeout 3600 and same 'nolog', my ip.pag file grew up to 922Mo (only since yesterday)

I'm going to keep that config a few days more and then I'll test by removing the "SecCollectionTimeout" directive to see if by default, it's indeed considered as 3600.

@zimmerle I'll gladly test you utilty, but, can you explain how to use it. I couldn't figure it out myself.

@zimmerle usually, I like to offer solutions, but in this case, I prefer not as I don't develop in c and I'm not sure I'll be able to do something good. Sorry.
But I'll help you as much as I can with debugging infos.

Xavior82 commented Oct 8, 2014

First off, thanks to all that contributed to this thread, it helped put me on the right track.

Just thought I would add a note here, considering I have had the exact same issue as quenenni. Server load would spike from 2 to 20 in a seemingly random manner, and each time it appeared to be an httpd process that was causing the load spike. The disk IO was also excruciatingly slow throughout the day.

After a lot of debugging, we narrowed down the issue to mod_security (oddly enough, and largely still unexplained, the same mod_security rules are deployed on dozens of servers but only a few were exhibiting issues). When we disabled the rules, the load issues stopped. When we reactivated, the load problem returned.

Running an lsof -p on the apache process eventually led me to our now-infamous tmp/ip.pag storage file. That file alone had reached 16GB (on a partition that was otherwise limited to 8GB, so not sure I understand how that could even work).

Following quenenni's lead I emptied the file and the performance improvement was instant. The load dropped entirely.

I added the following parameter to my config:
SecCollectionTimeout 600

And restarted Apache.

So far after 30 minutes or so the ip.pag file has grown to 117MB, but performance is still optimal.

So my question is: does setting the SecCollectionTimeout properly limit the file size in the long run?

FYI I am running on CloudLinux 6 with Apache 2.4

Owner

zimmerle commented Oct 8, 2014

Hi @Xavior82,

That is a difficult question to answer because it depends on the load of your server. Usually reduce this time out is enough to have collections with "feasible" size. You can list the content of the "ip.pag" or even perform a out-of-band "shrink" using modsec-sdbm-util.

Is the memcache-based replacement ready yet, as the current implementation seems broken ?

Owner

zimmerle commented Oct 28, 2014

Hi @tomchiverton,

there is this branch:
https://github.com/SpiderLabs/ModSecurity/tree/memcache_collections

It is not ready for production yet, there was some issues reported, and so far no time to fix it yet. Did you tried this `memcache_collections' branch already?

No, I was just wondering how well it had been tested, since the current storage system is still causing issues at even our moderate scale (~10req/sec), and from the looks of this bug has been for others for a year.
We're sticky plastering it with a shorter collection timeout, and probably removing the file with a weekly cron job just in case.

djprmf commented Apr 27, 2016

This is still being worked on?
Almost 3 years now....

Contributor

marcstern commented Apr 27, 2016

Is this specific to Apache?
Is there any plan to replace this mechanism by another one not using APR?
If the answer is "no" to the second questions, we can't let this bug forever! This gives a very bad image of the product.

Does anybody have an idea about the cause fo this bug? Is it so complex to fix?

If it's only a warning that we can ignore, shouldn't we disable it (or log it only once per day/week/month with the number of failures)?

Owner

zimmerle commented Apr 27, 2016

Hi @djprmf,

The back-end used to save the collections has some limitations that is boynd the way that ModSecurity uses it. As an alternative to this gdbm back-end we implemented the memchace test, which did not became so popular.

Others alternatives were also implemented, as the example of the modsec-sdbm-uitl:

https://github.com/SpiderLabs/modsec-sdbm-util

This utility can be used to list and cleanup/shrink the sdbm file out-of-band. To circumvent this particular "issue" you may want to reduce the collection timeout:

SecCollectionTimeout - https://github.com/SpiderLabs/ModSecurity/wiki/Reference-Manual#seccollectiontimeout

And/or, generally, use the collection with more parsimony. Depending on the amount of usage of the persistent collections functionality, even if there was a cleanup, the amount of I/O in your server will be very high due the way it was implemented.

In ModSecurity version 3 (under development) we tackle the persistent collection in other approach: we let the user select the proper back-end regarding the infrastructure that he/she wants to support. That allow us to support a larger number of users, from a simple setup (with less request/second, lets say...) to a high volume ISP (with very high number of requests).

I am tagging this issue as to be fixed as part of the release of ModSecurity version 3.

@marcstern The problem does not seems to be a implementation problem, rather, the back-end selection. Yes we can re-implement it without using apr or using other back-end such as the memcache implementation. That was the case, but the current target is the ModSecurity version 3 where this feature is actually under development.

Contributor

marcstern commented Apr 28, 2016

Sure, but why does this bug occur? Is it a bug in sdbm or the way we use it?
In the former case, we could check with the developers, in the second one we could use it in another way?

Owner

zimmerle commented Apr 28, 2016

Hi @marcstern,

Do you mind to get the text in your comment well formatted? it is important to keep the readability of those issues, it is served as knowledge base to many users. I think github allows you to edit your comment.

While editing the response it will be good if you can describe the circumstances where you are facing this issue, we may able to help by tuning your configuration a little bit.

Contributor

marcstern commented Apr 29, 2016

I have it on a dozen of machines (the others we suppressed the message).
All Redhat/CentOS 6.x with httpd 2.2 (all stack from platform).
We run modsec-sdbm-util every 15 minutes

Just ran into this issue in a test machine. Even after running it with -k, the ip.pag file was still very large:

root@httpsbackend:~/modsec-sdbm-util# ls -lsh /var/run/modsecurity/data/*
   0 -rw-r----- 1 www-data www-data    0 Feb  5 10:05 /var/run/modsecurity/data/global.dir
   0 -rw-r----- 1 www-data www-data    0 Feb  5 10:05 /var/run/modsecurity/data/global.pag
 36K -rw-r----- 1 www-data www-data  60K Feb  5 10:09 /var/run/modsecurity/data/ip.dir
7.5M -rw-r----- 1 www-data www-data 472M Feb 10 05:10 /var/run/modsecurity/data/ip.pag
root@httpsbackend:~/modsec-sdbm-util# ./modsec-sdbm-util -k /var/run/modsecurity/data/ip.pag 
Checking target directory: /tmp
Target directory exists.
Opening file: /var/run/modsecurity/data/ip.pag
Database ready to be used.

Total of 3 elements processed.
3 elements removed.
Expired elements: 3, inconsistent items: 0
Fragmentation rate: 100.00% of the database is/was dirty data.
root@httpsbackend:~/modsec-sdbm-util# ls -lsh /var/run/modsecurity/data/*
   0 -rw-r----- 1 www-data www-data    0 Feb  5 10:05 /var/run/modsecurity/data/global.dir
   0 -rw-r----- 1 www-data www-data    0 Feb  5 10:05 /var/run/modsecurity/data/global.pag
 36K -rw-r----- 1 www-data www-data  60K Feb  5 10:09 /var/run/modsecurity/data/ip.dir
7.5M -rw-r----- 1 www-data www-data 472M Feb 11 03:51 /var/run/modsecurity/data/ip.pag
root@httpsbackend:~/modsec-sdbm-util# service apache2 restart
root@httpsbackend:~/modsec-sdbm-util# ls -lsh /var/run/modsecurity/data/*
   0 -rw-r----- 1 www-data www-data    0 Feb  5 10:05 /var/run/modsecurity/data/global.dir
   0 -rw-r----- 1 www-data www-data    0 Feb  5 10:05 /var/run/modsecurity/data/global.pag
 36K -rw-r----- 1 www-data www-data  60K Feb  5 10:09 /var/run/modsecurity/data/ip.dir
7.5M -rw-r----- 1 www-data www-data 472M Feb 11 03:51 /var/run/modsecurity/data/ip.pag
root@httpsbackend:~/modsec-sdbm-util# ./modsec-sdbm-util -k /var/run/modsecurity/data/ip.pag 
Checking target directory: /tmp
Target directory exists.
Opening file: /var/run/modsecurity/data/ip.pag
Database ready to be used.

Total of 0 elements processed.
0 elements removed.
Expired elements: 0, inconsistent items: 0
root@httpsbackend:~/modsec-sdbm-util# ls -lsh /var/run/modsecurity/data/*
   0 -rw-r----- 1 www-data www-data    0 Feb  5 10:05 /var/run/modsecurity/data/global.dir
   0 -rw-r----- 1 www-data www-data    0 Feb  5 10:05 /var/run/modsecurity/data/global.pag
 36K -rw-r----- 1 www-data www-data  60K Feb  5 10:09 /var/run/modsecurity/data/ip.dir
7.5M -rw-r----- 1 www-data www-data 472M Feb 11 03:51 /var/run/modsecurity/data/ip.pag

This ticket is 2013, so far! My God, when will that be solved?

Contributor

bostrt commented Feb 15, 2017

Please see #576 (comment) Looks to be same issue and same solution.

Owner

zimmerle commented May 21, 2017

This is fixed by #1274.

@zimmerle zimmerle closed this May 21, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment