Skip to content

Commit

Permalink
regcomp.c: fix perl #129950 - fix firstchar bitmap under utf8 with pr…
Browse files Browse the repository at this point in the history
…efix optimisation

The trie code contains a number of sub optimisations, one of which
extracts common prefixes from alternations, and another which isa
bitmap of the possible matching first chars.

The bitmap needs to contain the possible first octets of the string
which the trie can match, and for codepoints which might have a different
first octet under utf8 or non-utf8 need to register BOTH codepoints.

So for instance in the pattern (?:a|a\x{E4}) we should restructure this
as a(|\x{E4), and the bitmap for the trie should contain both \x{E4} AND
\x{C3} as \x{C3} is the first byte of \x{EF} expressed as utf8.
  • Loading branch information
demerphq committed Oct 27, 2016
1 parent fd609c8 commit da42332
Show file tree
Hide file tree
Showing 2 changed files with 23 additions and 2 deletions.
14 changes: 14 additions & 0 deletions regcomp.c
Expand Up @@ -3264,6 +3264,13 @@ S_make_trie(pTHX_ RExC_state_t *pRExC_state, regnode *startbranch,
TRIE_BITMAP_SET(trie,*ch);
if ( folder )
TRIE_BITMAP_SET(trie, folder[ *ch ]);
if ( !UTF ) {
/* store first byte of utf8 representation of
variant codepoints */
if (! UVCHR_IS_INVARIANT(*ch)) {
TRIE_BITMAP_SET(trie, UTF8_TWO_BYTE_HI(*ch));
}
}
DEBUG_OPTIMISE_r(
Perl_re_printf( aTHX_ "%s", (char*)ch)
);
Expand All @@ -3272,6 +3279,13 @@ S_make_trie(pTHX_ RExC_state_t *pRExC_state, regnode *startbranch,
TRIE_BITMAP_SET(trie,*ch);
if ( folder )
TRIE_BITMAP_SET(trie,folder[ *ch ]);
if ( !UTF ) {
/* store first byte of utf8 representation of
variant codepoints */
if (! UVCHR_IS_INVARIANT(*ch)) {
TRIE_BITMAP_SET(trie, UTF8_TWO_BYTE_HI(*ch));
}
}
DEBUG_OPTIMISE_r(Perl_re_printf( aTHX_ "%s", ch));
}
idx = ofs;
Expand Down
11 changes: 9 additions & 2 deletions t/re/pat.t
Expand Up @@ -23,7 +23,7 @@ BEGIN {
skip_all('no re module') unless defined &DynaLoader::boot_DynaLoader;
skip_all_without_unicode_tables();

plan tests => 800; # Update this when adding/deleting tests.
plan tests => 802; # Update this when adding/deleting tests.

run_tests() unless caller;

Expand Down Expand Up @@ -1799,7 +1799,14 @@ EOP
TODO: {
local $::TODO = "RT #21491: m'' interpolates escape sequences";
is(0+("\n" =~ m'\n'), 0, q|RT #21491: m'\n' should not interpolate|);
}
}

{
my $str = "a\xE4";
ok( $str =~ m{^(a|a\x{e4})$}, "fix [perl #129950] - latin1 case" );
utf8::upgrade($str);
ok( $str =~ m{^(a|a\x{e4})$}, "fix [perl #129950] - utf8 case" );
}
} # End of sub run_tests

1;

0 comments on commit da42332

Please sign in to comment.