regexec.c: Fix failure to match problem

This bug has shown up only under EBCDIC so far, but could affect other code. Commit dcf88e3 fixed a bug in which a macro parameter needed to be dereferenced. Until then, the failure to dereference meant some code that turned out to be faulty, was effectively always skipped. So that commit, while correct in and of itself, exposed a pre-existing bug. It was hard for me to believe at first that a change of simply adding a missing '*' could have broken things this way. But the clue was that the only characters that were affected were the set of C1 controls, and only when the target matched string was in UTF-8, and only on EBCDIC systems. The difference between EBCDIC and ASCII platforms in this regard is that the C1 controls under UTF-8 are represented by a single byte on EBCDIC systems, and two bytes on ASCII. The test that adding the dereference to is looking for characters that are single bytes under both UTF-8 and not, and hence would give different results on EBCDIC and ASCII platforms for exactly the set of C1 controls. The code in question looks up an input code point to see if it is matched by an ANYOF node, the kind generated for bracketed character classes. The first N code points are stored in a bit vector. (N is generally 256, but perl can be compiled to make that larger.) If there are no complications, the answer can be found directly by just looking up the code point in the vector. But if there are complications, a function is called to sort them all out. The macro looks for complications, and calls the function if needed, but does the lookup directly if not. One of those complications is that the input needs to be decoded to its actual code point value if the target is UTF-8 and the code point isn't a single byte then. After the dereference fix, the caller of the macro knew correctly that this was a single byte, and so was calling the macro, But it turns out that the macro, as commented, was expecting to be called only if the target was not-UTF-8, and so unconditionally said to the function that it wasn't UTF-8, and so the function didn't work properly. The solution is to simply call the function in the macros with the correct value of whether the target string is UTF-8 or not.
Perl · Mar 1, 2016 · 451c6e0 · 451c6e0
1 parent 2202622
commit 451c6e0
Showing 1 changed file with 9 additions and 8 deletions.
diff --git a/regexec.c b/regexec.c
@@ -107,11 +107,12 @@ static const char* const non_utf8_target_but_utf8_required
 #define	STATIC	static
 #endif
 
-/* Valid only for non-utf8 strings: avoids the reginclass
- * call if there are no complications: i.e., if everything matchable is
- * straight forward in the bitmap */
-#define REGINCLASS(prog,p,c)  (ANYOF_FLAGS(p) ? reginclass(prog,p,c,c+1,0)   \
-					      : ANYOF_BITMAP_TEST(p,*(c)))
+/* Valid only if 'c', the character being looke-up, is an invariant under
+ * UTF-8: it avoids the reginclass call if there are no complications: i.e., if
+ * everything matchable is straight forward in the bitmap */
+#define REGINCLASS(prog,p,c,u)  (ANYOF_FLAGS(p)                             \
+                                ? reginclass(prog,p,c,c+1,u)                \
+                                : ANYOF_BITMAP_TEST(p,*(c)))
 
 /*
  * Forwards.
@@ -1864,7 +1865,7 @@ S_find_byclass(pTHX_ regexp * prog, const regnode *c, char *s,
                       reginclass(prog, c, (U8*)s, (U8*) strend, utf8_target));
         }
         else {
-            REXEC_FBC_CLASS_SCAN(REGINCLASS(prog, c, (U8*)s));
+            REXEC_FBC_CLASS_SCAN(REGINCLASS(prog, c, (U8*)s, 0));
         }
         break;
 
@@ -6118,7 +6119,7 @@ S_regmatch(pTHX_ regmatch_info *reginfo, char *startpos, regnode *prog)
 		locinput += UTF8SKIP(locinput);
 	    }
 	    else {
-		if (!REGINCLASS(rex, scan, (U8*)locinput))
+		if (!REGINCLASS(rex, scan, (U8*)locinput, utf8_target))
 		    sayNO;
 		locinput++;
 	    }
@@ -8664,7 +8665,7 @@ S_regrepeat(pTHX_ regexp *prog, char **startposp, const regnode *p,
 		hardcount++;
 	    }
 	} else {
-	    while (scan < loceol && REGINCLASS(prog, p, (U8*)scan))
+	    while (scan < loceol && REGINCLASS(prog, p, (U8*)scan, 0))
 		scan++;
 	}
 	break;