Possible regexp memory explosion in 5.20.0 #13984

p5pRT · 2014-07-13T14:27:14Z

Migrated from rt.perl.org#122283 (status was 'resolved')

Searchable as RT122283$

p5pRT · 2014-07-13T14:27:15Z

From @hvds

Created by @hvds

I've been experimenting with an attempt to take a SQL grammar expressed
in BNF and convert it (programmatically) into something that can parse
SQL with it as a Regexp::Grammars (v1.035) grammar.

The code below is (60%) cut down from an interim stage in that process;
this reaches about 10MB process size under perl-5.16.3; under perl-5.20.0
it grows to over 1GB. Cutting down the grammar rule by rule does gradually
reduce the memory use, but it remains a high multiple of the memory use
under perl-5.16.3, and I've not yet found any smoking gun; I've included
the full 200-odd lines here rather than risk eliding something important.

Damain and I are looking into it, but he suggested I perlbug it as a
heads-up of a possible problem in 5.20, likely of interest to davem
as potentially relating to regexp engine changes.

zen% ulimit -v # I've set a 1GB process-size limit
1000000
zen% /usr/bin/time /opt/perl-5.16.3/bin/perl ./t0 # top(1) shows peak 10MB VIRT
ok
8.52user 0.01system 0:08.54elapsed 99%CPU (0avgtext+0avgdata 34816maxresident)k
0inputs+0outputs (0major+2331minor)pagefaults 0swaps
zen% /usr/bin/time /opt/perl-5.20.0/bin/perl ./t0
Out of memory!
Command exited with non-zero status 1
41.59user 2.10system 0:43.83elapsed 99%CPU (0avgtext+0avgdata 3641344maxresident)k
0inputs+0outputs (0major+228082minor)pagefaults 0swaps
zen% cat t0
#!/opt/perl-5.20.0/bin/perl
use strict;
use warnings;
use Regexp::Grammars;

my $g = qr{
^ <query_specification> $

<rule: simple_Latin_letter> <simple_Latin_upper_case_letter> | <simple_Latin_lower_case_letter>
<token: simple_Latin_upper_case_letter> A | B | C | D | E | F | G | H | I | J | K | L | M | N | O | P | Q | R | S | T | U | V | W | X | Y | Z
<token: simple_Latin_lower_case_letter> a | b | c | d | e | f | g | h | i | j | k | l | m | n | o | p | q | r | s | t | u | v | w | x | y | z
<token: digit> 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
<token: double_quote> \"
<token: left_paren> $
<token: right_paren> $
<token: asterisk> \*
<token: plus_sign> \+
<token: comma> \,
<token: minus_sign> \-
<token: period> \.
<token: solidus> \/
<token: less_than_operator> \<
<token: equals_operator> \=
<token: greater_than_operator> \>
<token: question_mark> \?
<token: underscore> _
<rule: regular_identifier> <identifier_body>
<rule: identifier_body> <identifier_start> (?:<underscore> | <identifier_part>)*
<token: identifier_start> \w
<rule: identifier_part> <identifier_start> | <digit>
<rule: unsigned_integer> [0-9]+
<rule: sign> <plus_sign> | <minus_sign>
<rule: introducer> <underscore>
<rule: character_set_specification> <standard_character_repertoire_name> | <implementation_dash_defined_character_repertoire_name> | <user_dash_defined_character_repertoire_name> | <standard_universal_character_form_dash_of_dash_use_name> | <implementation_dash_defined_universal_character_form_dash_of_dash_use_name>
<rule: standard_character_repertoire_name> <character_set_name>
<rule: character_set_name> (?:<schema_name> <period>)? <SQL_language_identifier>
<rule: schema_name> (?:<catalog_name> <period>)? <unqualified_schema_name>
<rule: catalog_name> <identifier>
<rule: identifier> (?:<introducer> <character_set_specification>)? <actual_identifier>
<rule: actual_identifier> <regular_identifier>
<token: nondoublequote_character> [^"]
<rule: doublequote_symbol> <double_quote> <double_quote>
<rule: unqualified_schema_name> <identifier>
<rule: SQL_language_identifier> <SQL_language_identifier_start> (?:<underscore> | <SQL_language_identifier_part>)*
<rule: SQL_language_identifier_start> <simple_Latin_letter>
<rule: SQL_language_identifier_part> <simple_Latin_letter> | <digit>
<rule: implementation_dash_defined_character_repertoire_name> <character_set_name>
<rule: user_dash_defined_character_repertoire_name> <character_set_name>
<rule: standard_universal_character_form_dash_of_dash_use_name> <character_set_name>
<rule: implementation_dash_defined_universal_character_form_dash_of_dash_use_name> <character_set_name>
<token: not_equals_operator> \<\>
<token: greater_than_or_equals_operator> \>\=
<token: less_than_or_equals_operator> \<\=
<token: concatenation_operator> \|\|
<rule: qualified_local_table_name> MODULE <period> <local_table_name>
<rule: local_table_name> <qualified_identifier>
<rule: qualified_identifier> <identifier>
<rule: column_name> <identifier>
<rule: time_precision> <time_fractional_seconds_precision>
<rule: time_fractional_seconds_precision> <unsigned_integer>
<rule: timestamp_precision> <time_fractional_seconds_precision>
<rule: interval_qualifier> <start_field> TO <end_field> | <single_datetime_field>
<rule: start_field> <non_dash_second_datetime_field> (?:<left_paren> <interval_leading_field_precision> <right_paren>)?
<token: non_dash_second_datetime_field> YEAR | MONTH | DAY | HOUR | MINUTE
<rule: interval_leading_field_precision> <unsigned_integer>
<rule: end_field> <non_dash_second_datetime_field> | SECOND (?:<left_paren> <interval_fractional_seconds_precision> <right_paren>)?
<rule: interval_fractional_seconds_precision> <unsigned_integer>
<rule: single_datetime_field> <non_dash_second_datetime_field> (?:<left_paren> <interval_leading_field_precision> <right_paren>)? | SECOND (?:<left_paren> <interval_leading_field_precision> (?:<comma> <left_paren> <interval_fractional_seconds_precision>)? <right_paren>)?
<rule: qualified_name> (?:<schema_name> <period>)? <qualified_identifier>
<rule: datetime_value_function> <current_date_value_function> | <current_time_value_function> | <current_timestamp_value_function>
<token: current_date_value_function> CURRENT_DATE
<rule: current_time_value_function> CURRENT_TIME (?:<left_paren> <time_precision> <right_paren>)?
<rule: current_timestamp_value_function> CURRENT_TIMESTAMP (?:<left_paren> <timestamp_precision> <right_paren>)?
<rule: table_name> <qualified_name> | <qualified_local_table_name>
<rule: column_name_list> <column_name> (?:<comma> <column_name>)*
<rule: search_condition> <in_predicate>
<rule: boolean_term> <boolean_factor> (?:AND <boolean_factor>)*
<rule: boolean_factor> (?:NOT)? <boolean_test>
<rule: boolean_test> <boolean_primary> (?:IS (?:NOT)? <truth_value>)?
<rule: boolean_primary> <predicate> | <left_paren> <search_condition> <right_paren>
<rule: predicate> <comparison_predicate> | <between_predicate> | <in_predicate> | <like_predicate> | <null_predicate> | <quantified_comparison_predicate> | <exists_predicate> | <match_predicate> | <overlaps_predicate>
<rule: comparison_predicate> <row_value_constructor> <comp_op> <row_value_constructor>
<rule: row_value_constructor> <row_value_constructor_element> | <left_paren> <row_value_constructor_list> <right_paren> | <row_subquery>
<rule: row_value_constructor_element> <value_expression> | <null_specification> | <default_specification>
<rule: value_expression> <numeric_value_expression> | <string_value_expression> | <datetime_value_expression> | <interval_value_expression>
<rule: numeric_value_expression> <term> (?:<plus_sign> <term> | <minus_sign> <term>)*
<rule: term> <factor> (?:<asterisk> <factor> | <solidus> <factor>)*
<rule: factor> <sign>? <numeric_primary>
<rule: numeric_primary> <value_expression_primary> | <numeric_value_function>
<rule: value_expression_primary> <question_mark> | <column_reference>
<rule: column_reference> (?:<qualifier> <period>)? <column_name>
<rule: qualifier> <table_name> | <correlation_name>
<rule: correlation_name> <identifier>
<token: set_quantifier> DISTINCT | ALL
<rule: subquery> <left_paren> <query_expression> <right_paren>
<rule: query_expression> <non_dash_join_query_expression> | <joined_table>
<rule: non_dash_join_query_expression> (?:<non_dash_join_query_term> | <joined_table> UNION (?:ALL)? <corresponding_spec>? <query_term> | <joined_table> EXCEPT (?:ALL)? <corresponding_spec>? <query_term>) (?:UNION (?:ALL)? <corresponding_spec>? <query_term> | EXCEPT (?:ALL)? <corresponding_spec>? <query_term>)*
<rule: non_dash_join_query_term> (?:<non_dash_join_query_primary> | <joined_table> INTERSECT (?:ALL)? <corresponding_spec>? <query_primary>) (?:INTERSECT (?:ALL)? <corresponding_spec>? <query_primary>)*
<rule: non_dash_join_query_primary> <simple_table> | <left_paren> <non_dash_join_query_expression> <right_paren>
<rule: simple_table> <query_specification> | <table_value_constructor> | <explicit_table>
<rule: query_specification> SELECT <set_quantifier>? <select_list> <table_expression>
<rule: select_list> <asterisk> | <select_sublist> (?:<comma> <select_sublist>)*
<rule: select_sublist> <derived_column> | <qualifier> <period> <asterisk>
<rule: derived_column> <value_expression> <as_clause>?
<rule: as_clause> (?:AS)? <column_name>
<rule: table_expression> <from_clause> <where_clause>?
<rule: from_clause> FROM <table_reference> (?:<comma> <table_reference>)*
<rule: table_reference> <table_name>
<rule: table_subquery> <subquery>
<rule: joined_table> <cross_join> | <qualified_join> | <left_paren> <joined_table> <right_paren>
<rule: cross_join> <table_reference> CROSS JOIN <table_reference>
<rule: qualified_join> <table_reference> (?:NATURAL)? <join_type>? JOIN <table_reference> <join_specification>?
<rule: join_type> INNER | <outer_join_type> (?:OUTER)? | UNION
<token: outer_join_type> LEFT | RIGHT | FULL
<rule: join_specification> <join_condition> | <named_columns_join>
<rule: join_condition> ON <search_condition>
<rule: named_columns_join> USING <left_paren> <join_column_list> <right_paren>
<rule: join_column_list> <column_name_list>
<rule: where_clause> WHERE <search_condition>
<rule: collate_clause> COLLATE <collation_name>
<rule: collation_name> <qualified_name>
<rule: table_value_constructor> VALUES <table_value_constructor_list>
<rule: table_value_constructor_list> <row_value_constructor> (?:<comma> <row_value_constructor>)*
<rule: explicit_table> TABLE <table_name>
<rule: query_term> <non_dash_join_query_term> | <joined_table>
<rule: corresponding_spec> CORRESPONDING (?:BY <left_paren> <corresponding_column_list> <right_paren>)?
<rule: corresponding_column_list> <column_name_list>
<rule: query_primary> <non_dash_join_query_primary> | <joined_table>
<rule: numeric_value_function> <position_expression> | <extract_expression> | <length_expression>
<rule: position_expression> POSITION <left_paren> <character_value_expression> IN <character_value_expression> <right_paren>
<rule: character_value_expression> <character_factor> (?:<concatenation_operator> <character_factor>)*
<rule: concatenation> <character_value_expression> <concatenation_operator> <character_factor>
<rule: character_factor> <character_primary> <collate_clause>?
<rule: character_primary> <value_expression_primary> | <string_value_function>
<rule: string_value_function> <character_value_function> | <bit_value_function>
<rule: character_value_function> <character_substring_function> | <fold> | <form_dash_of_dash_use_conversion> | <character_translation> | <trim_function>
<rule: character_substring_function> SUBSTRING <left_paren> <character_value_expression> FROM <start_position> (?:FOR <string_length>)? <right_paren>
<rule: start_position> <numeric_value_expression>
<rule: string_length> <numeric_value_expression>
<rule: fold> (?:UPPER | LOWER) <left_paren> <character_value_expression> <right_paren>
<rule: form_dash_of_dash_use_conversion> CONVERT <left_paren> <character_value_expression> USING <form_dash_of_dash_use_conversion_name> <right_paren>
<rule: form_dash_of_dash_use_conversion_name> <qualified_name>
<rule: character_translation> TRANSLATE <left_paren> <character_value_expression> USING <translation_name> <right_paren>
<rule: translation_name> <qualified_name>
<rule: trim_function> TRIM <left_paren> <trim_operands> <right_paren>
<rule: trim_operands> (?:<trim_specification>? <trim_character>? FROM)? <trim_source>
<token: trim_specification> LEADING | TRAILING | BOTH
<rule: trim_character> <character_value_expression>
<rule: trim_source> <character_value_expression>
<rule: bit_value_function> <bit_substring_function>
<rule: bit_substring_function> SUBSTRING <left_paren> <bit_value_expression> FROM <start_position> (?:FOR <string_length>)? <right_paren>
<rule: bit_value_expression> <bit_factor> (?:<concatenation_operator> <bit_factor>)*
<rule: bit_concatenation> <bit_value_expression> <concatenation_operator> <bit_factor>
<rule: bit_factor> <bit_primary>
<rule: bit_primary> <value_expression_primary> | <string_value_function>
<rule: extract_expression> EXTRACT <left_paren> <extract_field> FROM <extract_source> <right_paren>
<rule: extract_field> <datetime_field> | <time_zone_field>
<rule: datetime_field> <non_dash_second_datetime_field> | SECOND
<token: time_zone_field> TIMEZONE_HOUR | TIMEZONE_MINUTE
<rule: extract_source> <datetime_value_expression> | <interval_value_expression>
<rule: datetime_value_expression> (?:<datetime_term> | <interval_value_expression> <plus_sign> <datetime_term>) (?:<plus_sign> <interval_term> | <minus_sign> <interval_term>)*
<rule: interval_term> (?:<interval_factor> | <term> <asterisk> <interval_factor>) (?:<asterisk> <factor> | <solidus> <factor>)*
<rule: interval_factor> <sign>? <interval_primary>
<rule: interval_primary> <value_expression_primary> <interval_qualifier>?
<rule: interval_term_2> <interval_term>
<rule: interval_value_expression> (?:<interval_term> | <left_paren> <datetime_value_expression> <minus_sign> <datetime_term> <right_paren> <interval_qualifier>) (?:<plus_sign> <interval_term_1> | <minus_sign> <interval_term_1>)*
<rule: interval_value_expression_1> <interval_value_expression>
<rule: interval_term_1> <interval_term>
<rule: datetime_term> <datetime_factor>
<rule: datetime_factor> <datetime_primary> <time_zone>?
<rule: datetime_primary> <value_expression_primary> | <datetime_value_function>
<rule: time_zone> AT <time_zone_specifier>
<rule: time_zone_specifier> LOCAL | TIME ZONE <interval_value_expression>
<rule: length_expression> <char_length_expression> | <octet_length_expression> | <bit_length_expression>
<rule: char_length_expression> (?:CHAR_LENGTH | CHARACTER_LENGTH) <left_paren> <string_value_expression> <right_paren>
<rule: string_value_expression> <character_value_expression> | <bit_value_expression>
<rule: octet_length_expression> OCTET_LENGTH <left_paren> <string_value_expression> <right_paren>
<rule: bit_length_expression> BIT_LENGTH <left_paren> <string_value_expression> <right_paren>
<token: null_specification> NULL
<token: default_specification> DEFAULT
<rule: row_value_constructor_list> <row_value_constructor_element> (?:<comma> <row_value_constructor_element>)*
<rule: row_subquery> <subquery>
<rule: comp_op> <equals_operator> | <not_equals_operator> | <less_than_operator> | <greater_than_operator> | <less_than_or_equals_operator> | <greater_than_or_equals_operator>
<rule: between_predicate> <row_value_constructor> (?:NOT)? BETWEEN <row_value_constructor> AND <row_value_constructor>
<rule: in_predicate> <row_value_constructor> (?:NOT)? IN <in_predicate_value>
<rule: in_predicate_value> <left_paren> <in_value_list> <right_paren>
<rule: in_value_list> <question_mark> (?:<comma> <question_mark>)+
<rule: like_predicate> <match_value> (?:NOT)? LIKE <pattern> (?:ESCAPE <escape_character>)?
<rule: match_value> <character_value_expression>
<rule: pattern> <character_value_expression>
<rule: escape_character> <character_value_expression>
<token: null_predicate> IS (?:NOT)? NULL
<rule: quantified_comparison_predicate> <row_value_constructor> <comp_op> <quantifier> <table_subquery>
<rule: quantifier> <all> | <some>
<token: all> ALL
<token: some> SOME | ANY
<rule: exists_predicate> EXISTS <table_subquery>
<rule: match_predicate> <row_value_constructor> MATCH (?:UNIQUE)? (?:PARTIAL | FULL)? <table_subquery>
<rule: overlaps_predicate> <row_value_constructor_1> OVERLAPS <row_value_constructor_2>
<rule: row_value_constructor_1> <row_value_constructor>
<rule: row_value_constructor_2> <row_value_constructor>
<token: truth_value> TRUE | FALSE | UNKNOWN

}x;
print "ok\n";
zen%

Perl Info


Flags:
    category=core
    severity=medium

Site configuration information for perl 5.20.0:

Configured by hv at Thu Jun 26 22:38:47 BST 2014.

Summary of my perl5 (revision 5 version 20 subversion 0) configuration:
   
  Platform:
    osname=linux, osvers=2.6.32-46-generic, archname=i686-linux
    uname='linux shad 2.6.32-46-generic #108-ubuntu smp thu apr 11 15:55:01 utc 2013 i686 gnulinux '
    config_args='-des -Dprefix=/opt/perl-5.20.0 -Doptimize=-g -O6 -Dusedevel -Uversiononly'
    hint=recommended, useposix=true, d_sigaction=define
    useithreads=undef, usemultiplicity=undef
    use64bitint=undef, use64bitall=undef, uselongdouble=undef
    usemymalloc=n, bincompat5005=undef
  Compiler:
    cc='cc', ccflags ='-fwrapv -fno-strict-aliasing -pipe -fstack-protector -I/usr/local/include -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64',
    optimize='-g -O6',
    cppflags='-fwrapv -fno-strict-aliasing -pipe -fstack-protector -I/usr/local/include'
    ccversion='', gccversion='4.4.3', gccosandvers=''
    intsize=4, longsize=4, ptrsize=4, doublesize=8, byteorder=1234
    d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=12
    ivtype='long', ivsize=4, nvtype='double', nvsize=8, Off_t='off_t', lseeksize=8
    alignbytes=4, prototype=define
  Linker and Libraries:
    ld='cc', ldflags =' -fstack-protector -L/usr/local/lib'
    libpth=/usr/local/lib /usr/lib/gcc/i486-linux-gnu/4.4.3/include-fixed /usr/include/i486-linux-gnu /usr/lib /lib/../lib /usr/lib/../lib /lib /usr/lib/i486-linux-gnu /usr/lib64
    libs=-lnsl -ldb -ldl -lm -lcrypt -lutil -lc
    perllibs=-lnsl -ldl -lm -lcrypt -lutil -lc
    libc=libc-2.11.1.so, so=so, useshrplib=false, libperl=libperl.a
    gnulibc_version='2.11.1'
  Dynamic Linking:
    dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-Wl,-E'
    cccdlflags='-fPIC', lddlflags='-shared -g -O6 -L/usr/local/lib -fstack-protector'



@INC for perl 5.20.0:
    /opt/perl-5.20.0/lib/site_perl/5.20.0/i686-linux
    /opt/perl-5.20.0/lib/site_perl/5.20.0
    /opt/perl-5.20.0/lib/5.20.0/i686-linux
    /opt/perl-5.20.0/lib/5.20.0
    .


Environment for perl 5.20.0:
    HOME=/home/hv
    LANG=C
    LANGUAGE (unset)
    LD_LIBRARY_PATH (unset)
    LOGDIR (unset)
    PATH=/home/hv/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games
    PERL_BADLANG (unset)
    SHELL=/bin/bash

p5pRT · 2014-07-13T16:47:06Z

From DCONWAY@cpan.org

Further to Hugo's report...

I have now had the opportunity to investigate the problem, and have concluded that this has nothing to do with Regexp::Grammars per se, except that R::G is generating the enormous regex that 5.20 is failing to compile.

The attached example (constructed by removing all the R::G syntactic sugar from Hugo's original) does not make use of Regex::Grammars at all, and still leaks endlessly under 5.20...whereas it compiles repidly and without complaint under all of:

5.10.1
5.12.5
5.14.4
5.16.3
5.18.2

I hope this additional information may be of help in tracking down the regression.

Damian

p5pRT · 2014-07-13T16:47:06Z

From DCONWAY@cpan.org

Perl5.20_regex_compilation_problem.pl

p5pRT · 2014-07-13T16:47:06Z

The RT System itself - Status changed from 'new' to 'open'

p5pRT · 2014-07-13T23:34:55Z

From @arc

Damian Conway via RT <perlbug-followup@perl.org> wrote:

I have now had the opportunity to investigate the problem, and have concluded that this has nothing to do with Regexp::Grammars per se, except that R::G is generating the enormous regex that 5.20 is failing to compile.

Yes, it looks that way to me too. Thanks for supplying that reduction.

The file attached cuts down this regex further still, by removing all
the embedded code blocks, and the various DEFINEs whose names begin
"______0_88".

The symptoms I observed seem to be the same, though I also get a
"panic: memory wrap" error (apparently when passing 4 GiB of allocated
memory).

In blead, it looks like the immediate culprit is study_chunk() — it
starts 185 ms after the start of the sizing pass, and I haven't yet
had the patience to let it run long enough to finish.

--
Aaron Crane ** http://aaroncrane.co.uk/

p5pRT · 2014-07-13T23:34:55Z

From @arc

large_rx.pl

p5pRT · 2014-07-14T10:14:24Z

From @iabyn

On Mon, Jul 14, 2014 at 12:34:06AM +0100, Aaron Crane wrote:

Damian Conway via RT <perlbug-followup@perl.org> wrote:

I have now had the opportunity to investigate the problem, and have concluded that this has nothing to do with Regexp::Grammars per se, except that R::G is generating the enormous regex that 5.20 is failing to compile.

Yes, it looks that way to me too. Thanks for supplying that reduction.

The file attached cuts down this regex further still, by removing all
the embedded code blocks, and the various DEFINEs whose names begin
"______0_88".

The symptoms I observed seem to be the same, though I also get a
"panic: memory wrap" error (apparently when passing 4 GiB of allocated
memory).

In blead, it looks like the immediate culprit is study_chunk() — it
starts 185 ms after the start of the sizing pass, and I haven't yet
had the patience to let it run long enough to finish.

It bisects to the following. Yves...?

commit 099ec7d
Author: Yves Orton <demerphq@gmail.com>
Date: Fri Nov 22 01:08:39 2013 +0100

Fix RT #120600: Variable length lookbehind is not variable

Inside of study_chunk() we have to guard against infinite
recursion with recursive subpatterns. The existing logic
sort of worked, but didn't address all cases properly.

qr/
(?<W>a)
(?<BB>
(?=(?&W))(?<=(?&W))
)
(?&BB)
/x;

The pattern in the test would fail when the optimizer
was expanding (&BB). When it recursed, it creates a bitmap
for the recursion it performs, it then jumps back to
the BB node and then eventually does the first (&W) call.
At this point the bit for (&W) would be set in the bitmask.
When the recursion for the (&W) exited (fake exit through
the study frame logic) the bit was not /unset/. When the parser
then entered the (&W) again it was treated as a nested and
potentially infinite length pattern.

The fake-recursion in study-chunk made it little less obvious
what was going on in the debug output.

By reorganizing the code and adding logic to unset the bitmap
when exiting this bug was fixed. Unfortunately this also revealed
another little issue with patterns like this:

qr/x|(?0)/
qr/(x|(?1))/

which forced the creation of a new bitmask for each branch.
Effectively study_chunk treats each branch as an independent
pattern, so when we are expanding (?1) via the 'x' branch
we dont want that to prevent us from detecting the infinite recursion
in the (?1) branch. If you were to think of trips through study_chunk
as paths, and [] as recursive processing you would get something like:

BRANCH 'x' END
BRANCH (?0) [ 'x' END ]
BRANCH (?0) [ (?0) [ 'x' END ] ]
...

When we want something like:

BRANCH 'x' END
BRANCH (?0) [ 'x' END ]
BRANCH (?0) [ (?0) INFINITE_RECURSION ]

So when we deal with a branch we need to make a new recursion bitmask.

--
"Foul and greedy Dwarf - you have eaten the last candle."
-- "Hordes of the Things", BBC Radio.

p5pRT · 2014-07-14T13:07:32Z

From @demerphq

The only useful thing I have to add /right now/ is that I am glad I wrote a
decent commit message. :-)

On 14 July 2014 12:13, Dave Mitchell <davem@iabyn.com> wrote:

On Mon, Jul 14, 2014 at 12:34:06AM +0100, Aaron Crane wrote:

Damian Conway via RT <perlbug-followup@perl.org> wrote:

I have now had the opportunity to investigate the problem, and have
concluded that this has nothing to do with Regexp::Grammars per se, except
that R::G is generating the enormous regex that 5.20 is failing to compile.

Yes, it looks that way to me too. Thanks for supplying that reduction.

The file attached cuts down this regex further still, by removing all
the embedded code blocks, and the various DEFINEs whose names begin
"______0_88".

The symptoms I observed seem to be the same, though I also get a
"panic: memory wrap" error (apparently when passing 4 GiB of allocated
memory).

In blead, it looks like the immediate culprit is study_chunk() — it
starts 185 ms after the start of the sizing pass, and I haven't yet
had the patience to let it run long enough to finish.

It bisects to the following. Yves...?

commit 099ec7d
Author: Yves Orton <demerphq@gmail.com>
Date: Fri Nov 22 01:08:39 2013 +0100
Fix RT \#120600&#8203;: Variable length lookbehind is not variable

Inside of study\_chunk we have to guard against infinite
recursion with recursive subpatterns\. The existing logic
sort of worked\, but didn't address all cases properly\.

  qr/
    $?\<W>a$
    $?\<BB>
      \(?=\(?&W$\)$?\<=\(?&W$\)
    \)
    $?&BB$
  /x;

The pattern in the test would fail when the optimizer
was expanding $&BB$\. When it recursed\, it creates a bitmap
for the recursion it performs\, it then jumps back to
the BB node and then eventually does the first $&W$ call\.
At this point the bit for $&W$ would be set in the bitmask\.
When the recursion for the $&W$ exited $fake exit through
the study frame logic$ the bit was not /unset/\. When the parser
then entered the $&W$ again it was treated as a nested and
potentially infinite length pattern\.

The fake\-recursion in study\-chunk made it little less obvious
what was going on in the debug output\.

By reorganizing the code and adding logic to unset the bitmap
when exiting this bug was fixed\. Unfortunately this also revealed
another little issue with patterns like this&#8203;:

  qr/x|$?0$/
  qr/$x|\(?1$\)/

which forced the creation of a new bitmask for each branch\.
Effectively study\_chunk treats each branch as an independent
pattern\, so when we are expanding $?1$ via the 'x' branch
we dont want that to prevent us from detecting the infinite recursion
in the $?1$ branch\. If you were to think of trips through study\_chunk
as paths\, and \[\] as recursive processing you would get something like&#8203;:

  BRANCH 'x' END
  BRANCH $?0$ \[ 'x' END \]
  BRANCH $?0$ \[ $?0$ \[ 'x' END \] \]
  \.\.\.

When we want something like&#8203;:

  BRANCH 'x' END
  BRANCH $?0$ \[ 'x' END \]
  BRANCH $?0$ \[ $?0$ INFINITE\_RECURSION \]

So when we deal with a branch we need to make a new recursion bitmask\.
--
"Foul and greedy Dwarf - you have eaten the last candle."
-- "Hordes of the Things", BBC Radio.

--
perl -Mre=debug -e "/just|another|perl|hacker/"

p5pRT · 2014-07-14T18:16:35Z

From @khwilliamson

On 07/14/2014 04:13 AM, Dave Mitchell wrote:

It bisects to the following

I'm curious as to how you bisected this. When I tried running Aaron's
script on my machine, it quickly ate up all the memory available. What
I was planning to do to bisect it was to add a call to setrlimit() to
perlmain.c to cause it to die when it used up a much smaller amount of
memory, long before my machine starts thrashing. But perhaps you have a
better way that would be educational for me and others to hear about.

p5pRT · 2014-07-14T20:13:06Z

From @iabyn

On Mon, Jul 14, 2014 at 12:15:46PM -0600, Karl Williamson wrote:

On 07/14/2014 04:13 AM, Dave Mitchell wrote:

It bisects to the following

I'm curious as to how you bisected this. When I tried running Aaron's
script on my machine, it quickly ate up all the memory available. What I
was planning to do to bisect it was to add a call to setrlimit() to
perlmain.c to cause it to die when it used up a much smaller amount of
memory, long before my machine starts thrashing. But perhaps you have a
better way that would be educational for me and others to hear about.

I just started a new shell and did

$ ulimit -v 500000

then ran the bisect.

(I had to experiment for a minute or so to find a suitable limit that
ran ok on 5.18.0 and died quickly on 5.2.0.)

--
My get-up-and-go just got up and went.

p5pRT · 2014-07-22T18:29:39Z

From dana@acm.org

A tool I found useful for this is massif from the valgrind suite, e.g.:

valgrind --tool=massif --stacks=yes --alloc-fn=Perl_safesysmalloc --alloc-fn=Perl_safesyscalloc --alloc-fn=Perl_safesysrealloc --alloc-fn=Perl_Slab_Alloc --time-unit=B --max-snapshots=1000 perl -MMath::Prime::XS=:all -E 'say 1'

(in this case, load up a module and do basically nothing else), then:

massif-visualizer massif.out.#### [#### depending on the file]

to see the graphical results.

This shows, for example, a memory spike from Perl__invlist_union_maybe_complement_2nd that shows up in 5.20.0 and 5.21.2 that is not in 5.19.7 when processing constant.pm's:

my $normal_constant_name = qr/^_?[^\W_0-9]\w*\z/;

which means it hits lots of modules. Tracking down the Perl source that causes given memory behavior is not very straightforward, but I think the tool is pretty valuable for seeing how memory is being used, and what is causing the use, over time.

p5pRT · 2014-07-30T17:35:33Z

From @khwilliamson

On 07/22/2014 12:29 PM, Dana Jacobsen via RT wrote:

A tool I found useful for this is massif from the valgrind suite, e.g.:

valgrind --tool=massif --stacks=yes --alloc-fn=Perl_safesysmalloc --alloc-fn=Perl_safesyscalloc --alloc-fn=Perl_safesysrealloc --alloc-fn=Perl_Slab_Alloc --time-unit=B --max-snapshots=1000 perl -MMath::Prime::XS=:all -E 'say 1'

(in this case, load up a module and do basically nothing else), then:

massif-visualizer massif.out.#### [#### depending on the file]

to see the graphical results.

This shows, for example, a memory spike from Perl__invlist_union_maybe_complement_2nd that shows up in 5.20.0 and 5.21.2 that is not in 5.19.7 when processing constant.pm's:

my $normal_constant_name = qr/^_?[^\W_0-9]\w*\z/;

which means it hits lots of modules. Tracking down the Perl source that causes given memory behavior is not very straightforward, but I think the tool is pretty valuable for seeing how memory is being used, and what is causing the use, over time.

---
via perlbug: queue: perl5 status: open
https://rt-archive.perl.org/perl5/Ticket/Display.html?id=122283

The memory spike occurs when taking the union of two lists. At the
beginning, it allocates enough memory for the worst case scenario, in
which the lists are completely disjoint, so the memory required is the
sum of the memory required by each list. At the end, the resultant list
is trimmed to the actual amount used.

This is done to avoid having to ask for extra memory in the middle of
the operation and potentially have to do extra copies.

p5pRT · 2014-07-31T08:10:54Z

From @demerphq

On 14 July 2014 12:13, Dave Mitchell <davem@iabyn.com> wrote:

On Mon, Jul 14, 2014 at 12:34:06AM +0100, Aaron Crane wrote:

Damian Conway via RT <perlbug-followup@perl.org> wrote:

I have now had the opportunity to investigate the problem, and have
concluded that this has nothing to do with Regexp::Grammars per se, except
that R::G is generating the enormous regex that 5.20 is failing to compile.

Yes, it looks that way to me too. Thanks for supplying that reduction.

The file attached cuts down this regex further still, by removing all
the embedded code blocks, and the various DEFINEs whose names begin
"______0_88".

The symptoms I observed seem to be the same, though I also get a
"panic: memory wrap" error (apparently when passing 4 GiB of allocated
memory).

In blead, it looks like the immediate culprit is study_chunk() — it
starts 185 ms after the start of the sizing pass, and I haven't yet
had the patience to let it run long enough to finish.

It bisects to the following. Yves...?

commit 099ec7d
Author: Yves Orton <demerphq@gmail.com>
Date: Fri Nov 22 01:08:39 2013 +0100
Fix RT \#120600&#8203;: Variable length lookbehind is not variable

Inside of study\_chunk we have to guard against infinite
recursion with recursive subpatterns\. The existing logic
sort of worked\, but didn't address all cases properly\.

  qr/
    $?\<W>a$
    $?\<BB>
      \(?=\(?&W$\)$?\<=\(?&W$\)
    \)
    $?&BB$
  /x;

The pattern in the test would fail when the optimizer
was expanding $&BB$\. When it recursed\, it creates a bitmap
for the recursion it performs\, it then jumps back to
the BB node and then eventually does the first $&W$ call\.
At this point the bit for $&W$ would be set in the bitmask\.
When the recursion for the $&W$ exited $fake exit through
the study frame logic$ the bit was not /unset/\. When the parser
then entered the $&W$ again it was treated as a nested and
potentially infinite length pattern\.

The fake\-recursion in study\-chunk made it little less obvious
what was going on in the debug output\.

By reorganizing the code and adding logic to unset the bitmap
when exiting this bug was fixed\. Unfortunately this also revealed
another little issue with patterns like this&#8203;:

  qr/x|$?0$/
  qr/$x|\(?1$\)/

which forced the creation of a new bitmask for each branch\.
Effectively study\_chunk treats each branch as an independent
pattern\, so when we are expanding $?1$ via the 'x' branch
we dont want that to prevent us from detecting the infinite recursion
in the $?1$ branch\. If you were to think of trips through study\_chunk
as paths\, and \[\] as recursive processing you would get something like&#8203;:

  BRANCH 'x' END
  BRANCH $?0$ \[ 'x' END \]
  BRANCH $?0$ \[ $?0$ \[ 'x' END \] \]
  \.\.\.

When we want something like&#8203;:

  BRANCH 'x' END
  BRANCH $?0$ \[ 'x' END \]
  BRANCH $?0$ \[ $?0$ INFINITE\_RECURSION \]

So when we deal with a branch we need to make a new recursion bitmask\.

I will try to find some time for this.

Yves

p5pRT · 2014-08-26T15:34:09Z

From @demerphq

On 14 July 2014 12:13, Dave Mitchell <davem@iabyn.com> wrote:

On Mon, Jul 14, 2014 at 12:34:06AM +0100, Aaron Crane wrote:

Damian Conway via RT <perlbug-followup@perl.org> wrote:

I have now had the opportunity to investigate the problem, and have
concluded that this has nothing to do with Regexp::Grammars per se, except
that R::G is generating the enormous regex that 5.20 is failing to compile.

Yes, it looks that way to me too. Thanks for supplying that reduction.

The file attached cuts down this regex further still, by removing all
the embedded code blocks, and the various DEFINEs whose names begin
"______0_88".

The symptoms I observed seem to be the same, though I also get a
"panic: memory wrap" error (apparently when passing 4 GiB of allocated
memory).

In blead, it looks like the immediate culprit is study_chunk() — it
starts 185 ms after the start of the sizing pass, and I haven't yet
had the patience to let it run long enough to finish.

It bisects to the following. Yves...?

commit 099ec7d
Author: Yves Orton <demerphq@gmail.com>
Date: Fri Nov 22 01:08:39 2013 +0100
Fix RT \#120600&#8203;: Variable length lookbehind is not variable

Inside of study\_chunk we have to guard against infinite
recursion with recursive subpatterns\. The existing logic
sort of worked\, but didn't address all cases properly\.

  qr/
    $?\<W>a$
    $?\<BB>
      \(?=\(?&W$\)$?\<=\(?&W$\)
    \)
    $?&BB$
  /x;

The pattern in the test would fail when the optimizer
was expanding $&BB$\. When it recursed\, it creates a bitmap
for the recursion it performs\, it then jumps back to
the BB node and then eventually does the first $&W$ call\.
At this point the bit for $&W$ would be set in the bitmask\.
When the recursion for the $&W$ exited $fake exit through
the study frame logic$ the bit was not /unset/\. When the parser
then entered the $&W$ again it was treated as a nested and
potentially infinite length pattern\.

The fake\-recursion in study\-chunk made it little less obvious
what was going on in the debug output\.

By reorganizing the code and adding logic to unset the bitmap
when exiting this bug was fixed\. Unfortunately this also revealed
another little issue with patterns like this&#8203;:

  qr/x|$?0$/
  qr/$x|\(?1$\)/

which forced the creation of a new bitmask for each branch\.
Effectively study\_chunk treats each branch as an independent
pattern\, so when we are expanding $?1$ via the 'x' branch
we dont want that to prevent us from detecting the infinite recursion
in the $?1$ branch\. If you were to think of trips through study\_chunk
as paths\, and \[\] as recursive processing you would get something like&#8203;:

  BRANCH 'x' END
  BRANCH $?0$ \[ 'x' END \]
  BRANCH $?0$ \[ $?0$ \[ 'x' END \] \]
  \.\.\.

When we want something like&#8203;:

  BRANCH 'x' END
  BRANCH $?0$ \[ 'x' END \]
  BRANCH $?0$ \[ $?0$ INFINITE\_RECURSION \]

So when we deal with a branch we need to make a new recursion bitmask\.

Some basic details on this issue:

In order to detect infinite recursion, and to expand certain constructs to
make match more efficient, we make study_chunk walk every possible pathway
through the graph formed by the grammar.

So for instance if we have node C which uses D which uses E, and we can
reach C from both A and B then we will walk the full C-D-E twice.

One could probably argue this is wrong and that we should somehow store
that C-D-E is "safe" when first hit it, and then avoid walking it a second
time.

This is coupled with the naive bitmask strategy for marking which nodes we
have seen.

More to come later.

Yves

p5pRT · 2014-09-25T07:42:31Z

From @demerphq

On 13 July 2014 16:27, Hugo van der Sanden <perlbug-followup@perl.org>
wrote:

# New Ticket Created by Hugo van der Sanden
# Please include the string: [perl #122283]
# in the subject line of all future correspondence about this issue.
# <URL: https://rt-archive.perl.org/perl5/Ticket/Display.html?id=122283 >

This is a bug report for perl from hv@crypt.org,
generated with the help of perlbug 1.40 running under perl 5.20.0.

-----------------------------------------------------------------
[Please describe your issue here]

I've been experimenting with an attempt to take a SQL grammar expressed
in BNF and convert it (programmatically) into something that can parse
SQL with it as a Regexp::Grammars (v1.035) grammar.

The code below is (60%) cut down from an interim stage in that process;
this reaches about 10MB process size under perl-5.16.3; under perl-5.20.0
it grows to over 1GB. Cutting down the grammar rule by rule does gradually
reduce the memory use, but it remains a high multiple of the memory use
under perl-5.16.3, and I've not yet found any smoking gun; I've included
the full 200-odd lines here rather than risk eliding something important.

Damain and I are looking into it, but he suggested I perlbug it as a
heads-up of a possible problem in 5.20, likely of interest to davem
as potentially relating to regexp engine changes.

zen% ulimit -v # I've set a 1GB process-size limit
1000000
zen% /usr/bin/time /opt/perl-5.16.3/bin/perl ./t0 # top(1) shows peak 10MB
VIRT
ok
8.52user 0.01system 0:08.54elapsed 99%CPU (0avgtext+0avgdata
34816maxresident)k
0inputs+0outputs (0major+2331minor)pagefaults 0swaps
zen% /usr/bin/time /opt/perl-5.20.0/bin/perl ./t0
Out of memory!
Command exited with non-zero status 1
41.59user 2.10system 0:43.83elapsed 99%CPU (0avgtext+0avgdata
3641344maxresident)k
0inputs+0outputs (0major+228082minor)pagefaults 0swaps
zen% cat t0
#!/opt/perl-5.20.0/bin/perl
use strict;
use warnings;
use Regexp::Grammars;

my $g = qr{
^ <query_specification> $

<rule: simple_Latin_letter> <simple_Latin_upper_case_letter> |
<simple_Latin_lower_case_letter>
<token: simple_Latin_upper_case_letter> A | B | C | D | E | F | G | H | I
| J | K | L | M | N | O | P | Q | R | S | T | U | V | W | X | Y | Z
<token: simple_Latin_lower_case_letter> a | b | c | d | e | f | g | h | i
| j | k | l | m | n | o | p | q | r | s | t | u | v | w | x | y | z
<token: digit> 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9

You really shoud use character classes here, and not use regex subs for
insertable literals. IOW, (?&digit) should be replaced with $digit which
would be defined as:

$digit= "[0-9]"

Similar for (?&ws) and similar patterns.

Anyway, I have pushed the following commit which should fix this. Please
test.

commit a51d618
Author: Yves Orton <demerphq@gmail.com>
Date: Fri Sep 19 19:57:34 2014 +0200

rt 122283 - do not recurse into GOSUB/GOSTART when not SCF_DO_SUBSTR

See also comments in patch. A complex regex "grammar" like that in
RT 122283 causes perl to take literally forever, and exhaust all
memory during the pattern optimization phase.

Unfortunately I could not track down exacty why this occured, but
it was very clear that the excessive recursion was unnecessary and
excessive. By simply eliminating the unncessary recursion performance
goes back to being acceptable.

I have not thought of a good way to test this change, so this patch
does not include any tests. Perhaps we can test it using alarm, but
I will follow up on that later.

Ticket closers: please dont close the ticket until I have reported that I
have applied tests for this.

cheers,
Yves

--
perl -Mre=debug -e "/just|another|perl|hacker/"

p5pRT · 2014-09-25T08:47:06Z

From @hvds

demerphq <demerphq@gmail.com> wrote:
:Anyway, I have pushed the following commit which should fix this. Please
:test.
:
:commit a51d618
:Author: Yves Orton <demerphq@gmail.com>
:Date: Fri Sep 19 19:57:34 2014 +0200
:
: rt 122283 - do not recurse into GOSUB/GOSTART when not SCF_DO_SUBSTR

Thanks Yves, I'll try this out over the weekend.

Hugo

p5pRT · 2014-10-05T01:55:02Z

From @cpansprout

On Thu Sep 25 00:42:31 2014, demerphq wrote:

Anyway, I have pushed the following commit which should fix this. Please
test.

commit a51d618
Author: Yves Orton <demerphq@gmail.com>
Date: Fri Sep 19 19:57:34 2014 +0200

Ticket closers: please dont close the ticket until I have reported that I
have applied tests for this.

You applied tests as d9a72fc. Does that mean this can be closed?

(a51d618 is the cause of bug #122890, but I don’t think this needs to stay open because of that.)

--

Father Chrysostomos

p5pRT · 2014-10-05T08:36:56Z

From @demerphq

On 5 October 2014 03:55, Father Chrysostomos via RT <
perlbug-followup@perl.org> wrote:

On Thu Sep 25 00:42:31 2014, demerphq wrote:

Anyway, I have pushed the following commit which should fix this. Please
test.

commit a51d618
Author: Yves Orton <demerphq@gmail.com>
Date: Fri Sep 19 19:57:34 2014 +0200

Ticket closers: please dont close the ticket until I have reported that I
have applied tests for this.

You applied tests as d9a72fc. Does that mean this can be closed?

(a51d618 is the cause of bug #122890, but I don’t think this needs to stay
open because of that.

It is up to you. I would leave it open, but if you think its better to
close it then I am happy with that.

Yves

--
perl -Mre=debug -e "/just|another|perl|hacker/"

p5pRT · 2014-10-22T04:35:38Z

@cpansprout - Status changed from 'open' to 'resolved'

p5pRT closed this as completed Oct 22, 2014

p5pRT added Severity Medium distro-Linux type-core labels Oct 19, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Possible regexp memory explosion in 5.20.0 #13984

Possible regexp memory explosion in 5.20.0 #13984

p5pRT commented Jul 13, 2014

p5pRT commented Jul 13, 2014

p5pRT commented Jul 13, 2014

p5pRT commented Jul 13, 2014

p5pRT commented Jul 13, 2014

p5pRT commented Jul 13, 2014

p5pRT commented Jul 13, 2014

p5pRT commented Jul 14, 2014

p5pRT commented Jul 14, 2014

p5pRT commented Jul 14, 2014

p5pRT commented Jul 14, 2014

p5pRT commented Jul 22, 2014

p5pRT commented Jul 30, 2014

p5pRT commented Jul 31, 2014

p5pRT commented Aug 26, 2014

p5pRT commented Sep 25, 2014

p5pRT commented Sep 25, 2014

p5pRT commented Oct 5, 2014

p5pRT commented Oct 5, 2014

p5pRT commented Oct 22, 2014

Possible regexp memory explosion in 5.20.0 #13984

Possible regexp memory explosion in 5.20.0 #13984

Comments

p5pRT commented Jul 13, 2014

p5pRT commented Jul 13, 2014

From @hvds

Created by @hvds

p5pRT commented Jul 13, 2014

From DCONWAY@cpan.org

p5pRT commented Jul 13, 2014

From DCONWAY@cpan.org

p5pRT commented Jul 13, 2014

p5pRT commented Jul 13, 2014

From @arc

p5pRT commented Jul 13, 2014

From @arc

p5pRT commented Jul 14, 2014

From @iabyn

p5pRT commented Jul 14, 2014

From @demerphq

p5pRT commented Jul 14, 2014

From @khwilliamson

p5pRT commented Jul 14, 2014

From @iabyn

p5pRT commented Jul 22, 2014

From dana@acm.org

p5pRT commented Jul 30, 2014

From @khwilliamson

p5pRT commented Jul 31, 2014

From @demerphq

p5pRT commented Aug 26, 2014

From @demerphq

p5pRT commented Sep 25, 2014

From @demerphq

p5pRT commented Sep 25, 2014

From @hvds

p5pRT commented Oct 5, 2014

From @cpansprout

p5pRT commented Oct 5, 2014

From @demerphq

p5pRT commented Oct 22, 2014