-
Notifications
You must be signed in to change notification settings - Fork 542
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
regexp very slow on UTF8 (over 100.000 times slower than without UTF8) #9774
Comments
From Andreas.Hansson@axis.comCreated by andhan@axis.comA regexp on >32 M chars takes 0 seconds without UTF8 Test with code below: #!/usr/bin/perl -w my $load = "Abra ka dabra\n ----- /var/log/messages -----\n"; # TEST 1 # TEST 2 sub _parse($) if ($history =~ / (?:.*)------/s) } Perl Info
|
From p5p@spam.wizbit.beOn Mon Jun 22 04:51:07 2009, Andreas.Hansson@axis.com wrote:
#!/usr/bin/perl -w my $load = "Abra ka dabra\n ----- /var/log/messages -----\n"; # TEST 1 # TEST 2 sub _parse { if ($history =~ / (?:.*)------/s) Behaviour is the same on all versions of perl; $ perl-5.10.0 rt-66852.pl Running with a shorter string and re 'debug': #!/usr/bin/perl -w use re 'debug'; # TEST 1 # TEST 2 sub _parse { if ($history =~ / (?:.*)------/s) $ perl-5.10.0 rt-66852.pl Compiling REx " (?:.*)------" testing with utf8=OFF... testing with utf8=ON... When the data is non UTF8 then the match is immediatly rejected by the When the data is in UTF8 then the match is not rejected by the Best regards, Bram |
The RT System itself - Status changed from 'new' to 'open' |
From @davidnicolOn Tue, Jun 23, 2009 at 4:02 PM, Bram via RT<perlbug-followup@perl.org> wrote:
So you can optimize it yourself at the Perl level: $ diff -U5 66852_orig.pl 66852.pl Inline Patch--- 66852_orig.pl 2009-06-23 22:22:32.000000000 -0500
+++ 66852.pl 2009-06-23 22:23:22.000000000 -0500
@@ -25,11 +25,12 @@
sub _parse {
my ($history) = @_;
- if ($history =~ / (?:.*)------/s)
+ if ($history =~ /------/
+ and $history =~ / (?:.*)------/s)
{
print "Match\n";
}
}
__END__
$ perl 66852.pl |
From Andreas.Hansson@axis.comThank you for your investigation. For information, I tested the same code with PHP5, and it seems that PHP's mb_ereg also lack performance... Test-result: Code: $load = "Abra ka dabra\n ----- /var/log/messages -----\n"; // TEST 1 // TEST 2 function _parse_iso8859($history) { function _parse_utf8($history) { ?> -----Original Message----- On Mon Jun 22 04:51:07 2009, Andreas.Hansson@axis.com wrote:
#!/usr/bin/perl -w my $load = "Abra ka dabra\n ----- /var/log/messages -----\n"; # TEST 1 # TEST 2 sub _parse { if ($history =~ / (?:.*)------/s) Behaviour is the same on all versions of perl; $ perl-5.10.0 rt-66852.pl Running with a shorter string and re 'debug': #!/usr/bin/perl -w use re 'debug'; # TEST 1 # TEST 2 sub _parse { if ($history =~ / (?:.*)------/s) $ perl-5.10.0 rt-66852.pl Compiling REx " (?:.*)------" testing with utf8=OFF... testing with utf8=ON... When the data is non UTF8 then the match is immediatly rejected by the When the data is in UTF8 then the match is not rejected by the Best regards, Bram |
From @demerphq2009/6/24 Andreas Hansson <Andreas.Hansson@axis.com>:
Just thinking aloud for sake of the ticket, but i think one might Why should utf8 affect case-insensitive matching? All one has to do is So, er, why dont we do that i wonder... Yves -- |
From perl@nevcal.comOn approximately 6/24/2009 9:28 AM, came the following characters from
I've wondered that for a long time. The arguments I've heard include: 1) Compatibility... upgrading the user's string might break something 2) Performance... updating the user's string to a temporary (to avoid 3) Not all strings are text; some may even be combinations of binary and I may have missed some. --
|
From @demerphq2009/6/24 Glenn Linderman <perl@nevcal.com>:
Actually, my point was slightly different here. The strings im talking So for instance the pattern in the two cases we are discussing here is Or am i missing a subtlety here that you covered? cheers, -- |
From @khwilliamsondemerphq wrote:
I don't know if this is relevant, but I recollect when looking at the |
From perl@nevcal.comOn approximately 6/24/2009 12:28 PM, came the following characters from
No, I'm overlooked the subtelty that you were discussing! For internal strings stored and used by the regex engine in the compiled So under the assumption that the strings are short, and are internal, 1) match against non-utf8 strings So then there are two sub-cases: a) the internal string has the same representation in utf8 and non-utf8 regex was initially designed for 1a only, I presume. When utf8 came Seems like for cases 1a, 2a, and 3a, the match should be done as octets. To handle the b subcases is more complex. Either there has to be logic Of course, handling things like case-insensitivity couldn't be done that Should I guess that the current reason is that distance in characters --
|
From victor@vsespb.ruI tested and seems issue if fixed in 5.20. Should it be backported to 5.18? On Wed Jun 24 15:17:08 2009, perl@nevcal.com wrote:
|
From @iabynOn Mon, Oct 20, 2014 at 06:21:47AM -0700, Victor Efimov via RT wrote:
It's likely it was fixed by my heavy re-working of the re_intuit_start() -- |
From @dcollinsnOn Mon Oct 20 08:16:08 2014, davem wrote:
Confirmed fixed, closing. -- |
@dcollinsn - Status changed from 'open' to 'resolved' |
Migrated from rt.perl.org#66852 (status was 'resolved')
Searchable as RT66852$
The text was updated successfully, but these errors were encountered: