-
Notifications
You must be signed in to change notification settings - Fork 550
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Some UTF-8 regular expression matches fail when read from file #15680
Comments
From @hiroshi-manabeYou can reproduc the bug with the following procedure: This happenes only when the string is read from a file handle and the second character is in the range of \x{80}-\x{ff}. |
The RT System itself - Status changed from 'new' to 'open' |
From @hiroshi-manabeOn 2016-10月-23 日 21:23:20, manabe.hiroshi@gmail.com wrote:
Sorry, the bug only reproduces itself when there is a set of parenthes, i.e. m{^(a|a\x{e4})$} etc. |
From [Unknown Contact. See original ticket]On 2016-10月-23 日 21:23:20, manabe.hiroshi@gmail.com wrote:
Sorry, the bug only reproduces itself when there is a set of parenthes, i.e. m{^(a|a\x{e4})$} etc. |
From @hiroshi-manabeOn 2016-10月-23 日 21:44:35, manabe.hiroshi@gmail.com wrote:
Sorry again, the correct unicode option for the step 2 was -Ci. |
From [Unknown Contact. See original ticket]On 2016-10月-23 日 21:44:35, manabe.hiroshi@gmail.com wrote:
Sorry again, the correct unicode option for the step 2 was -Ci. |
From @dcollinsnOn Sun Oct 23 21:48:55 2016, manabe.hiroshi@gmail.com wrote:
This seems interesting: $ perl -Ci -e 'open IN, "<", "foo.txt"; And with -Dr... dcollins@nightshade64:~/toolchain$ perl5.25.2-debug -D -Ci -e 'open IN, "<", "foo.txt"; EXECUTING... matched EXECUTING... Matching REx "^(a\x{e4})$" against "a%x{e4}" EXECUTING... Matching REx "^(a|a\x{e4})$" against "a%x{e4}" Unicode errors aside, is the TRIE optimization getting this wrong? -- |
From [Unknown Contact. See original ticket]On Sun Oct 23 21:48:55 2016, manabe.hiroshi@gmail.com wrote:
This seems interesting: $ perl -Ci -e 'open IN, "<", "foo.txt"; And with -Dr... dcollins@nightshade64:~/toolchain$ perl5.25.2-debug -D -Ci -e 'open IN, "<", "foo.txt"; EXECUTING... matched EXECUTING... Matching REx "^(a\x{e4})$" against "a%x{e4}" EXECUTING... Matching REx "^(a|a\x{e4})$" against "a%x{e4}" Unicode errors aside, is the TRIE optimization getting this wrong? -- |
From @tonycozOn Sun Oct 23 21:48:55 2016, manabe.hiroshi@gmail.com wrote:
The string doesn't need to be from a file: $ ./perl -e '$_ = "a\xE4"; utf8::upgrade( (blead perl) The match is failing around like 5611 of regexec.c: if ( trie->bitmap At this point nextchr has the first byte of the UTF-8 encoded \xE4 (0xc3). Tony |
From @iabynOn Mon, Oct 24, 2016 at 03:57:15PM -0700, Tony Cook via RT wrote:
I'm looking into this as we speak. -- |
From @demerphqOn 25 October 2016 at 12:12, Dave Mitchell <davem@iabyn.com> wrote:
I was going to look into it later as well. Let me know how far you get. We used to preload the bitmap with the first byte of the unicode Let me know otherwise. Yves -- |
From @iabynOn Tue, Oct 25, 2016 at 12:31:59PM +0200, demerphq wrote:
Not far as it turns out. The failing code has TRIE_BITMAP_TEST() returning $_ = "a\x64"; doesn't. -- |
From @demerphqOn 25 October 2016 at 13:45, Dave Mitchell <davem@iabyn.com> wrote:
Fixed. This ticket can be closed. commit da42332 regcomp.c: fix perl #129950 - fix firstchar bitmap under utf8 with The trie code contains a number of sub optimisations, one of which The bitmap needs to contain the possible first octets of the string So for instance in the pattern (?:a|a\x{E4}) we should restructure this Yves -- |
@iabyn - Status changed from 'open' to 'pending release' |
From @khwilliamsonThank you for filing this report. You have helped make Perl better. With the release today of Perl 5.26.0, this and 210 other issues have been Perl 5.26.0 may be downloaded via: If you find that the problem persists, feel free to reopen this ticket. |
@khwilliamson - Status changed from 'pending release' to 'resolved' |
Migrated from rt.perl.org#129950 (status was 'resolved')
Searchable as RT129950$
The text was updated successfully, but these errors were encountered: