Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP

Loading…

IP persistence storage seems to not clean up #574

Open
rcbarnett opened this Issue · 13 comments

5 participants

@rcbarnett
Owner

MODSEC-426: It seem that mod_security is not cleaning up the persistence storage for the IP-storage I am using.

Here is a simple test ruleset I am using (nothing else):

SecRule IP:dos_block "@eq 1" \
"phase:2,block,msg:'IP address blocked due to high connection rate.', \
id:2003,status:403,severity:'CRITICAL',tag:'DoS'"
SecRule SCRIPT_FILENAME "@rx /www/content/prod/dl/dl.php" \
"phase:5,chain,t:none,nolog,pass,id:2001,severity:'INFO',tag:'DoS',setvar:IP.dos_counter=+1,expirevar:IP.dos_counter=60"
SecRule IP:dos_counter "@gt 100" \
"t:none,setvar:IP.dos_block,setvar:!IP.dos_counter,expirevar:IP.dos_block=60"

The ruleset is working fine. But what I am noticing is, that the /var/tmp/ip.pag file is constantly growing. It is currently at >1GiB
size. When I do a "strings" on that file, I see entries like this:

__expire_KEY
1344192686
794.224.68.113_64d9d88c927c58359a3649f30a00e95070bbb8c0
TIMEOUT
3600
__key
794.224.68.113_64d9d88c927c58359a3649f30a00e95070bbb8c0
__name
CREATE_TIME
1344189037
UPDATE_COUNTER
dos_counter
__expire_dos_counter
1344189097
LAST_UPDATE_TIME
1344189086
94.224.68.113_64d9d88c927c58359a3649f30a00e95070bbb8c0
__expire_KEY
1343169406
868.167.225.250_77150c41806fda817314ba9c40e040c598830d5d
TIMEOUT
3600
__key
868.167.225.250_77150c41806fda817314ba9c40e040c598830d5d
__name
CREATE_TIME
1343165769
UPDATE_COUNTER
dos_counter
__expire_dos_counter
1343165829
LAST_UPDATE_TIME
1343165806
68.167.225.250_77150c41806fda817314ba9c40e040c598830dIR
__expire_KEY
1344005533
786.132.126.93_aa739e3aaaa1fbfc8667bb26120e9930df881a82
TIMEOUT
3600
__key
786.132.126.93_aa739e3aaaa1fbfc8667bb26120e9930df881a82
__name
[...]

Now when I convert the UNIX time strings to UTC, it shows me dates, which are already
expired since weeks. I would asume, that if an entry is expired, that the garbage collection
of mod_sec would remove the entry from the ip.pag file, to keep it small. Or am I wrong?

Any assistance would be highly appreciated.

@zimmerle zimmerle was assigned
@rcbarnett
Owner

Original reporter: wneessen

@rcbarnett
Owner

wneessen: (This is actually a clone of MODSEC-325)

Hi Breno,

I am sorry that I have to re-open this old ticket (or actually clone). After reading your response I didn't further investigate this issue and simply disabled mod_security on the specific host, which resolved our problems so far.
After almost a year now, I had to re-enable mod_security with the same ruleset again (hoping that the problem was gone given that there were couple of version upgrades... unfortunately it didn't.

Here is a short timeline of what happend and what I was able to capture:

Re-Enabled mod_security with the same ruleset on Tuesday. The rules worked as expected.
On wednesday I already noticed an increase in the /var/tmp/ip.pag file.
Thurday night, the apache got unresponsive for a period of at least 10 minutes (we switched to the backup server
to not lose traffic). The server-status page (which happend to load (very slowly) after a couple of minutes), showed
tons of "L" processes". While the mod-sec debug log was filled with lots of these:

[05/Sep/2013:21:47:09 +0000] [xxxxxxx.cleverbridge.com/sid#80171c530][rid#802dbe0a0][/XXXXXXXXX/xxxxxxx-xxxxxxx-xxxxxxx-xxxxxxx/xxxxxxx.pdf][1] collections_remove_stale: Failed deleting collection (name "ip", key "80.XXX.XXX.XXX_6e1c38356769095ad76267f9edb43ddae2af1491"): Internal error
[05/Sep/2013:21:47:09 +0000] [xxxxxxx.cleverbridge.com/sid#80171c530][rid#802dac0a0][/XXXXXXXXX/xxxxxxx-xxxxxxx-xxxxxxx-xxxxxxx/xxxxxxx.msi][1] collections_remove_stale: Failed deleting collection (name "ip", key "80.XXX.XXX.XXX_6e1c38356769095ad76267f9edb43ddae2af1491"): Internal error
[05/Sep/2013:21:50:47 +0000] [xxxxxxx.cleverbridge.com/sid#801739a50][rid#802d760a0][/server-status][1] collections_remove_stale: Failed deleting collection (name "ip", key "85.XXX.XXX.XXX_6e1c38356769095ad76267f9edb43ddae2af1491"): Internal error
[05/Sep/2013:21:50:47 +0000] [xxxxxxx.cleverbridge.com/sid#801739a50][rid#802d760a0][/server-status][1] collections_remove_stale: Failed deleting collection (name "ip", key "85.XXX.XXX.XXX_6e1c38356769095ad76267f9edb43ddae2af1491"): Internal error

The /var/tmp/ip.pag file grew to a size of 7GiB:

% ls -l /var/tmp/ip.pag :-/
-rw-r----- 1 www wheel 7560283136 Sep 6 12:05 /var/tmp/ip.pag

% du -A /var/tmp/ip.pag (FreeBSD userland "du" - values in kbytes)
7383089 /var/tmp/ip.pag

% gdu -b /var/tmp/ip.pag (GNU "du" on FreeBSD)
7560283136 /var/tmp/ip.pag

The SecCollectionTimeout was set to "600".

Though this is a download server, it is not hit that hard, that it should have a collection of 7 GiB of IP-Useragent combinations given
that the values should be overwritten after 10 minutes. Do you have any idea what is going on here? As said, the problem started
again after I enabled (the latest) mod_security version on this server. It was running w/o this problem for almost a year with mod_sec
disabled.

Thanks for you help.
Winfried

@quenenni

Hello,

We had the same unfortunate experience.
After 2 months and a half, the file ip.pag reached the size of 25Go (with "du -b" & "ls -l")
Our webserver became very unstable, with a load going from 5 to 60 rapidly then slowly going down (all the websites were very slow)

The solution was to erase ip.pag et let modsec create a new one.

In apache error.log file, we can see lots of:

[error] [client xxx.xxx.71.148] ModSecurity: Failed deleting collection (name "ip", key "83.43.238.96_befe6113014112e0700b42d9324b0f65e7f4bd6d"): Internal error [hostname "www.yyyyyyyy.org"] [uri "/zzzzzzzzzzzzzzzz"] [unique_id "UyXZ0k9jyzIAADnhARwAAAA-"]

I added in modsecurity.conf the config:

SecCollectionTimeout 600

but we still have the same errors.
I don't think this was the problem as the ref guide says that without that, SecCollectionTimeout has a default value of 3600

Server: Debian Wheezy
Apache: 2.2.22-13+deb7u1
ModSecurity:
libapache2-modsecurity 2.6.6-6+deb7u1
modsecurity-crs 2.2.5-2

@quenenni

Is there any way we can help debug that problem?
Our ip.pag file already grows beyond a reasonable size in less than a day.

The only info I found about it: "Looks like this error message is from a apr function." from Breno (seems to be the one in charge to find the bug) (http://marc.info/?l=mod-security-users&m=136683562925114&w=2)

@quenenni

The keys found in the error messages are NOT in the ip.pag file.

[error] [client xxx.xxx.71.148] ModSecurity: Failed deleting collection (name "ip", key "83.43.238.96_befe6113014112e0700b42d9324b0f65e7f4bd6d"): Internal error [hostname "www.yyyyyyyy.org"] [uri "/zzzzzzzzzzzzzzzz"] [unique_id "UyXZ0k9jyzIAADnhARwAAAA-"]

htdbm -l -Tsdbm /tmp/ip 2>&1 | grep "83.43.238.96_befe6113014112e0700b42d9324b0f65e7f4bd6d"

It seems the problem doesn't come from modsec not able to delete keys.

After a few checks, we can see that the number of keys in ip.pag file goes up and down.
So modsec is managing correctly the db.

From apache2/persist_dbm.c (line 218)

rc = apr_sdbm_delete(dbm, key);
        if (rc != APR_SUCCESS) {
            msr_log(msr, 1, "collection_retrieve_ex: Failed deleting collection (name \"%s\", "
                "key \"%s\"): %s", log_escape(msr->mp, col_name),
                log_escape_ex(msr->mp, col_key, col_key_len), get_apr_error(msr->mp, rc));
            msr->msc_sdbm_delete_error = 1;
            goto cleanup;
        }

And in apr-util-1.5.3, in dbm/sdbm/sdbm.c (line 230),

    if ((status = getpage(db, exhash(key))) == APR_SUCCESS) {
        if (!delpair(db->pagbuf, key))
            /* ### should we define some APRUTIL codes? */
            status = APR_EGENERAL;
        else
            status = write_page(db, db->pagbuf, db->pagbno);
    }

I'd say that the status code received in modsec comes from the call to getPage that is different from APR_SUCCESS as the key doesn't exist in the db.

We changed our modesc local exceptions to not log anymore (that represents a lot of requests) and set the SecCollectionTimeout directive to 600 (it wasn't set at all before but the manual says the default value is 3600) and since, the ip.pag file stays at a good size (110Mo for now)

Maybe you should just add more granularities to your tests and errors/warnings messages.

@zimmerle
Owner

Hi @quenenni, thanks.

Are you interested in provide your solution in the format of a pull request?

https://help.github.com/articles/using-pull-requests

@zimmerle
Owner

We have made an utility that may be useful for those who are using persistent collections in sdbm files. The utility allows the users to manipulate the sdbm files used by ModSecurity, including shrinking it, by remove the expired keys or malformed content.

The utility is available here:
https://github.com/SpiderLabs/modsec-sdbm-util

@quenenni, can you check if you still have malformed content by using this utility?

@quenenni

Sorry for the late answer.

I did some more testing by playing with the "SecCollectionTimeout" directive and by reducing the logging ('nolog' for lots of rules I did).

I can already say:
With SecCollectionTimeout 600 and lots of 'nolog', my ip.pag file grew up to 430Mo (6 days)
With SecCollectionTimeout 3600 and same 'nolog', my ip.pag file grew up to 922Mo (only since yesterday)

I'm going to keep that config a few days more and then I'll test by removing the "SecCollectionTimeout" directive to see if by default, it's indeed considered as 3600.

@zimmerle I'll gladly test you utilty, but, can you explain how to use it. I couldn't figure it out myself.

@zimmerle usually, I like to offer solutions, but in this case, I prefer not as I don't develop in c and I'm not sure I'll be able to do something good. Sorry.
But I'll help you as much as I can with debugging infos.

@Xavior82

First off, thanks to all that contributed to this thread, it helped put me on the right track.

Just thought I would add a note here, considering I have had the exact same issue as quenenni. Server load would spike from 2 to 20 in a seemingly random manner, and each time it appeared to be an httpd process that was causing the load spike. The disk IO was also excruciatingly slow throughout the day.

After a lot of debugging, we narrowed down the issue to mod_security (oddly enough, and largely still unexplained, the same mod_security rules are deployed on dozens of servers but only a few were exhibiting issues). When we disabled the rules, the load issues stopped. When we reactivated, the load problem returned.

Running an lsof -p on the apache process eventually led me to our now-infamous tmp/ip.pag storage file. That file alone had reached 16GB (on a partition that was otherwise limited to 8GB, so not sure I understand how that could even work).

Following quenenni's lead I emptied the file and the performance improvement was instant. The load dropped entirely.

I added the following parameter to my config:
SecCollectionTimeout 600

And restarted Apache.

So far after 30 minutes or so the ip.pag file has grown to 117MB, but performance is still optimal.

So my question is: does setting the SecCollectionTimeout properly limit the file size in the long run?

FYI I am running on CloudLinux 6 with Apache 2.4

@zimmerle
Owner

Hi @Xavior82,

That is a difficult question to answer because it depends on the load of your server. Usually reduce this time out is enough to have collections with "feasible" size. You can list the content of the "ip.pag" or even perform a out-of-band "shrink" using modsec-sdbm-util.

@tomchiverton

Is the memcache-based replacement ready yet, as the current implementation seems broken ?

@zimmerle
Owner

Hi @tomchiverton,

there is this branch:
https://github.com/SpiderLabs/ModSecurity/tree/memcache_collections

It is not ready for production yet, there was some issues reported, and so far no time to fix it yet. Did you tried this `memcache_collections' branch already?

@tomchiverton

No, I was just wondering how well it had been tested, since the current storage system is still causing issues at even our moderate scale (~10req/sec), and from the looks of this bug has been for others for a year.
We're sticky plastering it with a shorter collection timeout, and probably removing the file with a weekly cron job just in case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Something went wrong with that request. Please try again.