Update regex_suffix.c #1095

RainRat · 2023-11-19T08:44:20Z

described in #1085

micahsnyder · 2023-11-21T00:53:08Z

libclamav/regex_suffix.c

@@ -246,7 +246,7 @@ static uint8_t *parse_char_class(const uint8_t *pat, size_t patSize, size_t *pos
            for (c = range_start + 1; c <= range_end; c++)
                bitmap[c >> 3] ^= 1 << (c & 0x7);
            hasprev = 0;
-        } else if (pat[*pos] == '[' && pat[*pos] == ':') {
+        } else if (pat[*pos] == '[' && pat[*pos + 1] == ':') {


I'm unsure if this is the correct fix, or if it was supposed to be || instead of &&. Were you able to identify based on your understanding of regex character classes what this is trying to do and which would be the correct fix? I'm hesitant to make either change without knowing what was originally intended. Ideally we'd be able to test the fix also makes something work that previously wouldn't work. E.g. if there is some regex pattern that wouldn't work before and would now work after this PR.

I'm making suggestions based on static code review so you are right to question them.

Searching for "[:" makes sense because there are predefined character classes in POSIX ( https://en.wikibooks.org/wiki/Regular_Expressions/POSIX_Basic_Regular_Expressions#Character_classes ).

If using a regex with one of those character classes works with the fix but not without, that would show that the fix is correct.

The caller of this function is here in parse_regex():

case '[': ++*last; right = make_charclass(parse_char_class(p, pSize, last)); if (!right) { destroy_tree(v); return NULL; } v = make_node(concat, v, right); if (!v) { destroy_tree(right); return NULL; } ++*last; break;

So it has already found the leading [. So it would make sense if this were looking for something like [[:lower:]]

Note this regex parsing code is only used in the phishing signature regex parsing code for PDB and WDB signatures.

To test this, I made this PDB signature:

H:bankofamerica.com

to monitor the display domain name "bankofamerica.com" found in emails for suspicious real-URL's.
Then I made this WDB signature to test the regex features:

X:.+\.eba[[:digit:]]\.com:.+\.bankofamerica\.com

This silly signature will allow the real-URL to be eba5.com (or any other digit after eba), but anything other than that (such as ebay.com, or ebat.com) will not be ignored and will alert as suspicious.

I tested by scanning this email, with variations of the link:

From: "Somebody -X (username - Contractor at Example)" <username@example.com> To: "Your Name (Your Name)" <yourname@example.com> Subject: RE: Your Money Date: Fri, 12 Jun 2013 07:04:52 +0000 Content-Language: en-US Content-Type: multipart/alternative; boundary="_blah_" MIME-Version: 1.0 --_blah_ Content-Type: text/plain; charset="Windows-1252" Content-Transfer-Encoding: quoted-printable <html> <head> <meta http-equiv=3D"Content-Type" content=3D"text/html; charset=3DWindows-1=252"> </head> <body> <br> Please click this link: <br> <a href="http://www.eba5.com">https://www.bankofamerica.com</a> <br> </body> </html> --_blah_--

I tested ClamAV 1.2.0 to see if using the [[:digit:]] character class works.

E.g. I ran commands like this:

./install/bin/clamscan -d ../test.pdb -d ../test.wdb ../test.eml Loading: 0s, ETA: 0s [========================>] 1/1 sigs Compiling: 0s, ETA: 0s [========================>] 10/10 tasks LibClamAV info: Suspicious link found! LibClamAV info: Real URL: http://www.ebay.com LibClamAV info: Display URL: https://www.bankofamerica.com /home/micah/workspace/clamav-micah-2/test.eml: Heuristics.Phishing.Email.SSL-Spoof FOUND

It appears to work already, though I didn't debug step through line by line to see how it works. But I find that strange, because I did test in a debugger with this PR and was able to confirm it working, and also running the "new" code:

In short, it's not clear to me if this code is really needed. Though it does seem to work okay with your change. But it also works okay like this:

hasprev = 0; // } else if (pat[*pos] == '[' && pat[*pos + 1] == ':') { // /* char class */ // FREE(bitmap); // while (pat[*pos] != ']') INC_POS(pos, patSize); // INC_POS(pos, patSize); // while (pat[*pos] != ']') INC_POS(pos, patSize); // return dot_bitmap; } else { bitmap[pat[*pos] >> 3] ^= 1 << (pat[*pos] & 0x7);

As a side note: testing character classes relating to upper/lower character changes will not work as expected because the text is all converted to lowercase before matching.

Update regex_suffix.c

ce1fa7f

described in #1085

micahsnyder reviewed Nov 21, 2023

View reviewed changes

RainRat closed this by deleting the head repository May 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update regex_suffix.c #1095

Update regex_suffix.c #1095

RainRat commented Nov 19, 2023

micahsnyder Nov 21, 2023

RainRat Nov 23, 2023

micahsnyder Dec 1, 2023

Update regex_suffix.c #1095

Update regex_suffix.c #1095

Conversation

RainRat commented Nov 19, 2023

micahsnyder Nov 21, 2023

Choose a reason for hiding this comment

RainRat Nov 23, 2023

Choose a reason for hiding this comment

micahsnyder Dec 1, 2023

Choose a reason for hiding this comment