-
Notifications
You must be signed in to change notification settings - Fork 560
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Certain regex patterns cause fatal errors with valid UTF-8 #10434
Comments
From @benkasminbullockThis is a bug report for perl from benkasminbullock@gmail.com, The following script run on Cygwin prints out an error message Malformed UTF-8 character (fatal) at ./wwwjdicbug.pl line 75. However, the UTF-8 character which is claimed to be malformed comes ######### wwwjdicbug.pl #! perl package WWWJDIC; my %mirrors = ( sub new # Parse a page of results from WWWJDIC sub parse_results my @labels = $tree->look_down ('_tag', 'label'); sub lookup_url sub lookup sub lookup_kanji } 1; package main; my $wwwjdic = WWWJDIC::new(mirror => 'japan'); #### Output of ./wwwjdicbug.pl > bug.txt 2>&1 Looking up �$BAk8}�(B in WWWJDIC: <br> ############## End--- Site configuration information for perl 5.10.0: Configured by rurban at Mon Jun 30 16:03:19 GMT 2008. Summary of my perl5 (revision 5 version 10 subversion 0 patch 34065) Locally applied patches: @INC for perl 5.10.0: Environment for perl 5.10.0: |
From @benkasminbullockThis is a very much simplified version of the script which tripped the |
@benkasminbullock - Status changed from 'new' to 'open' |
From hector@debian.orgCreated by hector@debian.orgexecuting this (which works correctly on perl 5.8 gives an error #!/usr/bin/perl -w use utf8; my $p = 'á d</p>'; print "$p\n"; if ( hector@baloo:/tmp$ ./kk.pl The script fails for any utf8 definition of $p This regression has been tested also on a perl vanilla compilation on another server. Perl Info
|
From @ikegamiOn Mon, Mar 22, 2010 at 6:13 AM, Hector Garcia <perlbug-followup@perl.org>wrote:
Thanks for the report. Workaround until this is fixed: if ( Note that I removed the /g. "if (/.../g)" rarely makes any sense and can |
The RT System itself - Status changed from 'new' to 'open' |
From @khwilliamsonEric Brine wrote:
I wonder if this is related to #46563: g suffix on string search which is a won't fix |
From @ikegamiOn Mon, Mar 22, 2010 at 11:47 PM, karl williamson
The /g is not germane to the bug. The workaround wasn't the removal of the |
From @nwc10On Mon, Mar 22, 2010 at 09:47:07PM -0600, karl williamson wrote:
http://rt.perl.org/rt3/Ticket/Display.html?id=46563 For now and for older perls this bug is firmly in the "wont fix" It wasn't yet described as a "won't" fix if it's still in current blead. Nicholas Clark |
From @khwilliamsonNicholas Clark wrote:
I just tried it, and it is still a bug in 5.12RC0. |
From rivero@raulrivero.esOn Lun. Mar. 22 20:47:43 2010, public@khwilliamson.com wrote:
The /g isn't the problem: #!/usr/bin/perl -w use utf8; my $p = 'á d</p>'; print "$p\n"; if ( $ perl problem.pl And "m#(?:|(?!)\x{2660})(.*?)[-]?EFE\s*</p>$#sm" isn't a real If we change the (.*) and we use (\X*), it works. So, we think there is We could fix it with this patch: Inline Patch--- regcomp.c.OLD 2010-03-24 10:15:59.381767760 +0100
+++ regcomp.c 2010-03-24 10:17:03.068877134 +0100
@@ -6932,7 +6932,7 @@
ret = reg_node(pRExC_state, SANY);
else
ret = reg_node(pRExC_state, REG_ANY);
- *flagp |= HASWIDTH|SIMPLE;
+ *flagp |= HASWIDTH;
RExC_naughty++;
Set_Node_Length(ret, 1); /* MJD */
break;
Cheers, |
From hector@debian.orgThis bug has nothing to do with bug 46563 Thanks |
From @iabynOn Tue, Mar 23, 2010 at 02:58:58PM -0600, karl williamson wrote:
And here is a minimal(ish) case that triggers a 'Malformed UTF-8 $_ = "\x{e1} d</p>\x{100}"; -- |
From doug@ablegrape.comCreated by doug@ablegrape.comThis is a bug report for perl from doug@ablegrape.com, ----------------------------------------------------------------- Now it dies under 5.8.9, 5.10.0 and 5.12.1, with "Malformed UTF-8 character (fatal)" - but the input data is the same, and is, as far as I can tell, perfectly valid UTF-8. I've isolated the failure to a test case, included here, which shows a simple expression that works, two (very) slightly more complex expressions that fail, and the original complex expression from my code. As far as I can tell, all of these should work. Oddly, if I add "use encoding 'utf8'" even the simple regex fails. My best guess is that perhaps for some reason the regex engine is backing up by bytes within my string, and starting in the middle of a character. The string itself is perfectly valid. #!/usr/bin/perl use strict vars; my $e = "Böck"; if (utf8::is_utf8($e)) { print "yep, is UTF8: $e\n"; } # this succeeds (failed before with use encoding 'utf8', unknown why) # these die # the original, full expression. Perl Info
|
From bitcard@candiru.comFYI, discussion of this bug on Perlmonks: http://www.perlmonks.org/?node_id=843208 |
From @cowensAs a work around, I suggest you use the \x{} literal escape: my $e = "B\x{f6}ck"; It seems to work on my OS X machines. On Fri, Jun 11, 2010 at 15:15, Doug Cook <perlbug-followup@perl.org> wrote:
-- |
The RT System itself - Status changed from 'new' to 'open' |
From @khwilliamsonChas. Owens wrote:
Unfortunately the reason this workaround works is because it avoids
|
From @druud62Doug Cook wrote:
It could well be that your editor saves the source as either UTF-8 or -- |
From @tseeAccording to Yves, this was fixed by commit v5.13.4-25-g92f3d48. --Steffen |
@tsee - Status changed from 'open' to 'resolved' |
From @cpansproutThis appears to have been fixed. It may be the same bug as #75680. |
From @cpansproutOn Sun Sep 05 14:52:42 2010, sprout wrote:
Yes, it is the same. I’m marking this as resolved. |
@cpansprout - Status changed from 'open' to 'resolved' |
From @cpansproutOn Tue Jul 29 19:46:08 2008, BKB wrote:
Thank you for your report. You have ‘use utf8’ in your script, which signals to perl that your But then you have a string containing the octets 95 B6, which is not You do not need ‘use utf8’ if you are just *using* Unicode strings. |
@cpansprout - Status changed from 'open' to 'rejected' |
From @benkasminbullockI'm pretty sure I filed a very much simpler example of this bug after I don't think there was anything wrong with the utf8 etc., that is On 20 September 2010 05:48, Father Chrysostomos via RT
|
From @cpansproutOn Sun Sep 19 21:21:17 2010, BKB wrote:
I only looked at your reduced case at first. It was failing for the Your original script can be reduced to: perl -le' "(n) (See \x{7a93}\x{8ca9}) over the counter sales (often of It is the same as 75680 and 73732, which were fixed by |
@cpansprout - Status changed from 'rejected' to 'resolved' |
From @khwilliamsonFather Chrysostomos via RT wrote:
And this fix made it into 5.12.2, which is now an official Perl release |
Migrated from rt.perl.org#75680 (status was 'resolved')
Searchable as RT75680$
The text was updated successfully, but these errors were encountered: