Skip to content

Commit

Permalink
regexec.c: Fix failure to match problem
Browse files Browse the repository at this point in the history
This bug has shown up only under EBCDIC so far, but could affect other
code.

Commit dcf88e3 fixed a bug in which a
macro parameter needed to be dereferenced.  Until then, the failure to
dereference meant some code that turned out to be faulty, was
effectively always skipped.  So that commit, while correct in and of
itself, exposed a pre-existing bug.

It was hard for me to believe at first that a change of simply adding a
missing '*' could have broken things this way.  But the clue was that
the only characters that were affected were the set of C1 controls, and
only when the target matched string was in UTF-8, and only on EBCDIC
systems.  The difference between EBCDIC and ASCII platforms in this
regard is that the C1 controls under UTF-8 are represented by a single
byte on EBCDIC systems, and two bytes on ASCII.  The test that adding
the dereference to is looking for characters that are single bytes under
both UTF-8 and not, and hence would give different results on EBCDIC and
ASCII platforms for exactly the set of C1 controls.

The code in question looks up an input code point to see if it is
matched by an ANYOF node, the kind generated for bracketed character
classes.  The first N code points are stored in a bit vector.  (N is
generally 256, but perl can be compiled to make that larger.)  If there
are no complications, the answer can be found directly by just looking
up the code point in the vector.  But if there are complications, a
function is called to sort them all out.  The macro looks for
complications, and calls the function if needed, but does the lookup
directly if not.  One of those complications is that the input needs to
be decoded to its actual code point value if the target is UTF-8 and the
code point isn't a single byte then.  After the dereference fix, the
caller of the macro knew correctly that this was a single byte, and so
was calling the macro,  But it turns out that the macro, as commented,
was expecting to be called only if the target was not-UTF-8, and so
unconditionally said to the function that it wasn't UTF-8, and so the
function didn't work properly.

The solution is to simply call the function in the macros with the
correct value of whether the target string is UTF-8 or not.
  • Loading branch information
khwilliamson committed Mar 1, 2016
1 parent 2202622 commit 451c6e0
Showing 1 changed file with 9 additions and 8 deletions.
17 changes: 9 additions & 8 deletions regexec.c
Expand Up @@ -107,11 +107,12 @@ static const char* const non_utf8_target_but_utf8_required
#define STATIC static
#endif

/* Valid only for non-utf8 strings: avoids the reginclass
* call if there are no complications: i.e., if everything matchable is
* straight forward in the bitmap */
#define REGINCLASS(prog,p,c) (ANYOF_FLAGS(p) ? reginclass(prog,p,c,c+1,0) \
: ANYOF_BITMAP_TEST(p,*(c)))
/* Valid only if 'c', the character being looke-up, is an invariant under
* UTF-8: it avoids the reginclass call if there are no complications: i.e., if
* everything matchable is straight forward in the bitmap */
#define REGINCLASS(prog,p,c,u) (ANYOF_FLAGS(p) \
? reginclass(prog,p,c,c+1,u) \
: ANYOF_BITMAP_TEST(p,*(c)))

/*
* Forwards.
Expand Down Expand Up @@ -1864,7 +1865,7 @@ S_find_byclass(pTHX_ regexp * prog, const regnode *c, char *s,
reginclass(prog, c, (U8*)s, (U8*) strend, utf8_target));
}
else {
REXEC_FBC_CLASS_SCAN(REGINCLASS(prog, c, (U8*)s));
REXEC_FBC_CLASS_SCAN(REGINCLASS(prog, c, (U8*)s, 0));
}
break;

Expand Down Expand Up @@ -6118,7 +6119,7 @@ S_regmatch(pTHX_ regmatch_info *reginfo, char *startpos, regnode *prog)
locinput += UTF8SKIP(locinput);
}
else {
if (!REGINCLASS(rex, scan, (U8*)locinput))
if (!REGINCLASS(rex, scan, (U8*)locinput, utf8_target))
sayNO;
locinput++;
}
Expand Down Expand Up @@ -8664,7 +8665,7 @@ S_regrepeat(pTHX_ regexp *prog, char **startposp, const regnode *p,
hardcount++;
}
} else {
while (scan < loceol && REGINCLASS(prog, p, (U8*)scan))
while (scan < loceol && REGINCLASS(prog, p, (U8*)scan, 0))
scan++;
}
break;
Expand Down

0 comments on commit 451c6e0

Please sign in to comment.