doc/Language/regexes.pod6

=begin pod :tag<perl6>

=TITLE Regexes

=SUBTITLE Pattern matching against strings

X<|Regular Expressions>

Regular expressions, I<regexes> for short, are a sequence of characters that
describe a pattern of text. Pattern matching is the process of matching
those patterns to actual text.

=head1 X<Lexical conventions|quote,/ /;quote,rx;quote,m>

Perl 6 has special syntax for writing regexes:

    m/abc/;         # a regex that is immediately matched against $_
    rx/abc/;        # a Regex object; allow adverbs to be used before regex
    /abc/;          # a Regex object; shorthand version of 'rx/ /' operator

For the first two examples, delimiters other than the slash can be used:

    m{abc};
    rx{abc};

Note that neither the colon nor round parentheses can be delimiters; the colon
is forbidden because it clashes with adverbs, such as C<rx:i/abc/>
(case insensitive regexes), and round parentheses indicate a function call
instead.

Example of difference between C<m/ /> and C</ /> operators:

    my $match;
    $_ = "abc";
    $match = m/.+/; say $match; say $match.^name; # OUTPUT: «｢abc｣␤Match␤»
    $match =  /.+/; say $match; say $match.^name; # OUTPUT: «/.+/␤Regex␤»

Whitespace in regexes is generally ignored (except with the C<:s> or,
completely, C<:sigspace> adverb).

Comments work within a regular expression:

    / word #`(match lexical "word") /

=head1 Literals

The simplest case for a regex is a match against a string literal:

    if 'properly' ~~ / perl / {
        say "'properly' contains 'perl'";
    }

Alphanumeric characters and the underscore C<_> are matched literally. All
other characters must either be escaped with a backslash (for example, C<\:>
to match a colon), or be within quotes:

    / 'two words' /;     # matches 'two words' including the blank
    / "a:b"       /;     # matches 'a:b' including the colon
    / '#' /;             # matches a hash character

Strings are searched left to right, so it is enough if only part of the string
matches the regex:

    if 'Life, the Universe and Everything' ~~ / and / {
        say ~$/;           # OUTPUT: «and␤»
        say $/.prematch;   # OUTPUT: «Life, the Universe ␤»
        say $/.postmatch;  # OUTPUT: « Everything␤»
        say $/.from;        # OUTPUT: «19␤»
        say $/.to;          # OUTPUT: «22␤»
    };

Match results are stored in the C<$/> variable and are also returned from
the match. The result is of type L<Match|/type/Match> if the match was
successful; otherwise it is L<Nil|/type/Nil>.


=head1 X<Wildcards|regex, .>

An unescaped dot C<.> in a regex matches any single character.

So, these all match:

    'perl' ~~ /per./;       # matches the whole string
    'perl' ~~ / per . /;    # the same; whitespace is ignored
    'perl' ~~ / pe.l /;     # the . matches the r
    'speller' ~~ / pe.l/;   # the . matches the first l

This doesn't match:

    'perl' ~~ /. per /;

because there's no character to match before C<per> in the target string.

Note that C<.> now does match B<any> single character, that is, it matches
C<\n>. So the text below match:

    my $text = qq:to/END/
      Although I am a
      multi-line text,
      now can be matched
      with /.*/.
      END
      ;

    say $text ~~ / .* /;
    # OUTPUT «｢Although I am a␤multi-line text,␤now can be matched␤with /.*/␤｣»

=head1 Character classes

=head2 Backslashed character classes

There are predefined character classes of the form C<\w>. Its negation is
written with an upper-case letter, C<\W>.

=item X<C<\n> and C<\N>|regex,\n;regex,\N>

C<\n> matches a single, logical newline character.C<\N> matches a single
character that's not a logical newline.

What is considered as a single newline character is defined via the compile
time variable L«C<$?NL>|/language/variables#index-entry-$?NL», and the
L<newline pragma|/language/pragmas>.

Therefore, C<\n> is supposed to be able to match either a Unix-like
newline C<"\n">, a Microsoft Windows style one C<"\r\n">, or one in the Mac style C<"\r">.

=item X<C<\t> and C<\T>|regex,\t;regex,\T>

C<\t> matches a single tab/tabulation character, C<U+0009>. C<\T> matches a
single character that is not a tab.

Note that exotic tabs like the C<U+000B VERTICAL TABULATION> character are not
included here.

=item X<C<\h> and C<\H>|regex,\h;regex,\H>

C<\h> matches a single horizontal whitespace character. C<\H> matches a
single character that is not a horizontal whitespace character.

Examples for horizontal whitespace characters are

    =begin code :lang<text>
    U+0020 SPACE
    U+00A0 NO-BREAK SPACE
    U+0009 CHARACTER TABULATION
    U+2001 EM QUAD
    =end code

Vertical whitespace like newline characters are explicitly excluded; those
can be matched with C<\v>, and C<\s> matches any kind of whitespace.

=item X<C<\v> and C<\V>|regex,\v;regex,\V>

C<\v> matches a single vertical whitespace character. C<\V> matches a single
character that is not vertical whitespace.

Examples for vertical whitespace characters:

    =begin code :lang<text>
    U+000A LINE FEED
    U+000B VERTICAL TABULATION
    U+000C FORM FEED
    U+000D CARRIAGE RETURN
    U+0085 NEXT LINE
    U+2028 LINE SEPARATOR
    U+2029 PARAGRAPH SEPARATOR
    =end code

Use C<\s> to match any kind of whitespace, not just vertical whitespace.

=item X<C<\s> and C<\S>|regex,\s;regex,\S>

C<\s> matches a single whitespace character. C<\S> matches a single
character that is not whitespace.

    say $/.prematch if 'Match the first word.' ~~ / \s+ /;
    # OUTPUT: «Match␤»

=item X<C<\d> and C<\D>|regex,\d;regex,\D>

C<\d> matches a single digit (Unicode property C<N>) and C<\D> matches a
single character that is not a digit.

    'ab42' ~~ /\d/ and say ~$/;     # OUTPUT: «4␤»
    'ab42' ~~ /\D/ and say ~$/;     # OUTPUT: «a␤»

Note that not only the Arabic digits (commonly used in the Latin
alphabet) match C<\d>, but also digits from other scripts.

Examples for digits are:

    =begin code :lang<text>
    U+0035 5 DIGIT FIVE
    U+0BEB ௫ TAMIL DIGIT FIVE
    U+0E53 ๓ THAI DIGIT THREE
    U+17E5 ៥ KHMER DIGIT FIVE
    =end code

=item X<C<\w> and C<\W>|regex,\w;regex,\W>

C<\w> matches a single word character, i.e. a letter (Unicode category
L), a digit or an underscore. C<\W> matches a single character that is
not a word character.

Examples of word characters:

    =begin code :lang<text>
    0041 A LATIN CAPITAL LETTER A
    0031 1 DIGIT ONE
    03B4 δ GREEK SMALL LETTER DELTA
    03F3 ϳ GREEK LETTER YOT
    0409 Љ CYRILLIC CAPITAL LETTER LJE
    =end code

=head2 Predefined character classes

=begin table

    Class    | Shorthand | Description
    =========+===========+=============
    <alpha>  |           | Alphabetic characters including _
    <digit>  | \d        | Decimal digits
    <xdigit> |           | Hexadecimal digit [0-9A-Fa-f]
    <alnum>  | \w        | <alpha> plus <digit>
    <punct>  |           | Punctuation and Symbols (only Punct beyond ASCII)
    <graph>  |           | <alnum> plus <punct>
    <space>  | \s        | Whitespace
    <cntrl>  |           | Control characters
    <print>  |           | <graph> plus <space>, but no <cntrl>
    <blank>  | \h        | Horizontal whitespace
    <lower>  | <:Ll>     | Lowercase characters
    <upper>  | <:Lu>     | Uppercase characters
    <same>   |           | Matches between two identical characters
    <wb>     |           | Word Boundary
    <ws>     |           | Whitespace. This is actually a default rule.
    <ww>     |           | Within Word
    <ident>  |           | Identifier. Also a default rule.

=end table

Note that the character classes C«<same>», C«<wb>» and C«<ww>» are
so called zero-width assertions, which do not really match a
character.

=head2 X«Unicode properties|regex,<:property>»

The character classes mentioned so far are mostly for convenience; another
approach is to use Unicode character properties. These come in the form
C«<:property>», where C<property> can be a short or long Unicode General
Category name. These use pair syntax.

To match against a Unicode property you can use either smartmatch or
L<C<uniprop>|/routine/uniprop>:

    "a".uniprop('Script');                 # OUTPUT: «Latin␤»
    "a" ~~ / <:Script<Latin>> /;           # OUTPUT: «｢a｣␤»
    "a".uniprop('Block');                  # OUTPUT: «Basic Latin␤»
    "a" ~~ / <:Block('Basic Latin')> /;    # OUTPUT: «｢a｣␤»

These are the unicode general categories used for matching:

=begin table

    Short   |    Long
    ========+========
    L       |    Letter
    LC      |    Cased_Letter
    Lu      |    Uppercase_Letter
    Ll      |    Lowercase_Letter
    Lt      |    Titlecase_Letter
    Lm      |    Modifier_Letter
    Lo      |    Other_Letter
    M       |    Mark
    Mn      |    Nonspacing_Mark
    Mc      |    Spacing_Mark
    Me      |    Enclosing_Mark
    N       |    Number
    Nd      |    Decimal_Number or digit
    Nl      |    Letter_Number
    No      |    Other_Number
    P       |    Punctuation or punct
    Pc      |    Connector_Punctuation
    Pd      |    Dash_Punctuation
    Ps      |    Open_Punctuation
    Pe      |    Close_Punctuation
    Pi      |    Initial_Punctuation
    Pf      |    Final_Punctuation
    Po      |    Other_Punctuation
    S       |    Symbol
    Sm      |    Math_Symbol
    Sc      |    Currency_Symbol
    Sk      |    Modifier_Symbol
    So      |    Other_Symbol
    Z       |    Separator
    Zs      |    Space_Separator
    Zl      |    Line_Separator
    Zp      |    Paragraph_Separator
    C       |    Other
    Cc      |    Control or cntrl
    Cf      |    Format
    Cs      |    Surrogate
    Co      |    Private_Use
    Cn      |    Unassigned

=end table

For example, C«<:Lu>» matches a single, upper-case letter.

Its negation is this: C«<:!property>». So, C«<:!Lu>» matches a single
character that is not an upper-case letter.

Categories can be used together, with an infix operator:

=begin table
    Operator  |  Meaning
    ==========+=========
    +         |  set union
    -         |  set difference
=end table

To match either a lower-case letter or a number, write
C«<:Ll+:N>» or C«<:Ll+:Number>» or C«<+ :Lowercase_Letter + :Number>».

It's also possible to group categories and sets of categories with
parentheses; for example:

    say $0 if 'perl6' ~~ /\w+(<:Ll+:N>)/ # OUTPUT: «｢6｣␤»

=head2 X«Enumerated character classes and ranges|regex,<[ ]>;regex,<-[ ]>»

Sometimes the pre-existing wildcards and character classes are not
enough. Fortunately, defining your own is fairly simple. Within C«<[ ]>»,
you can put any number of single characters and ranges of characters
(expressed with two dots between the end points), with or without
whitespace.

    "abacabadabacaba" ~~ / <[ a .. c 1 2 3 ]>* /;
    # Unicode hex codepoint range
    "ÀÁÂÃÄÅÆ" ~~ / <[ \x[00C0] .. \x[00C6] ]>* /;
    # Unicode named codepoint range
    "αβγ" ~~ /<[\c[GREEK SMALL LETTER ALPHA]..\c[GREEK SMALL LETTER GAMMA]]>*/;

Within the C«< >» you can use C<+> and C<-> to add or remove multiple range
definitions and even mix in some of the unicode categories above. You can also
write the backslashed forms for character classes between the C<[ ]>.

    / <[\d] - [13579]> /;
    # starts with \d and removes odd ASCII digits, but not quite the same as
    / <[02468]> /;
    # because the first one also contains "weird" unicodey digits

You can include Unicode properties in the list as well:

    /<:Zs + [\x9] - [\xA0]>/
    # Any character with "Zs" property, or a tab, but not a newline

You can use C<\> to escape characters that would have some meaning in the
regular expression:

    say "[ hey ]" ~~ /<-[ \] \[ \s ]>+/; # OUTPUT: «｢hey｣␤»

To negate a character class, put a C<-> after the opening angle:

    say 'no quotes' ~~ /  <-[ " ]> + /;  # matches characters except "

A common pattern for parsing quote-delimited strings involves negated
character classes:

    say '"in quotes"' ~~ / '"' <-[ " ]> * '"'/;

This first matches a quote, then any characters that aren't quotes, and then
a quote again. The meaning of C<*> and C<+> in the examples above are
explained in the next section, Quantifier.

Just as you can use the C<-> for both set difference and negation of a
single value, you can also explicitly put a C<+> in front:

    / <+[123]> /  # same as <[123]>

=head1 Quantifiers

A quantifier makes the preceding atom match a
variable number of times. For example, C<a+> matches one or more C<a>
characters.

Quantifiers bind tighter than concatenation, so C<ab+> matches one C<a>
followed by one or more C<b>s. This is different for quotes, so C<'ab'+>
matches the strings C<ab>, C<abab>, C<ababab> etc.

=head2 X<One or more: C<+>|regex quantifier,+>

The C<+> quantifier makes the preceding atom match one or more times, with
no upper limit.

For example, to match strings of the form C<key=value>, you can write a regex
like this:

    / \w+ '=' \w+ /

=head2 X<Zero or more: C<*>|regex quantifier,*>

The C<*> quantifier makes the preceding atom match zero or more times, with
no upper limit.

For example, to allow optional whitespace between C<a> and C<b> you can write

    / a \s* b /

=head2 X<Zero or one: C<?>|regex quantifier,?>

The C<?> quantifier makes the preceding atom match zero or once.

From example, to match C<dog> or C<dogs>, you can write:

    / dogs? /

=head2 X<General quantifier: C<** min..max>|regex quantifier,**>

To quantify an atom an arbitrary number of times, use C<**> quantifier.
The quantifier takes a single L<Int> or a L<Range> on the right-hand side
that specifies the number of times to match. If L<Range> is specified,
the end-points specify the minimum and maximum number of times to match.

    =begin code
    say 'abcdefg' ~~ /\w ** 4/;      # OUTPUT: «｢abcd｣␤»
    say 'a'       ~~ /\w **  2..5/;  # OUTPUT: «Nil␤»
    say 'abc'     ~~ /\w **  2..5/;  # OUTPUT: «｢abc｣␤»
    say 'abcdefg' ~~ /\w **  2..5/;  # OUTPUT: «｢abcde｣␤»
    say 'abcdefg' ~~ /\w ** 2^..^5/; # OUTPUT: «｢abcd｣␤»
    say 'abcdefg' ~~ /\w ** ^3/;     # OUTPUT: «｢ab｣␤»
    say 'abcdefg' ~~ /\w ** 1..*/;   # OUTPUT: «｢abcdefg｣␤»
    =end code

Only basic literal syntax for the right-hand side of the quantifier
is supported, to avoid ambiguities with other regex constructs. If you need
to use a more complex expression; for example, a L<Range> made from
variables—enclose the L<Range> into curly braces:

    =begin code
    my $start = 3;
    say 'abcdefg' ~~ /\w ** {$start .. $start+2}/; # OUTPUT: «｢abcde｣␤»
    say 'abcdefg' ~~ /\w ** {π.Int}/;              # OUTPUT: «｢abc｣␤»
    =end code

Negative values are treated like zero:

    =begin code
    say 'abcdefg' ~~ /\w ** {-Inf}/;     # OUTPUT: «｢｣␤»
    say 'abcdefg' ~~ /\w ** {-42}/;      # OUTPUT: «｢｣␤»
    say 'abcdefg' ~~ /\w ** {-10..-42}/; # OUTPUT: «｢｣␤»
    say 'abcdefg' ~~ /\w ** {-42..-10}/; # OUTPUT: «｢｣␤»
    =end code

If then, the resultant value is C<Inf>
or C<NaN> or the resultant L<Range> is empty, non-Numeric, contains C<NaN>
end-points, or has minimum effective end-point as C<Inf>, the
C<X::Syntax::Regex::QuantifierValue> exception will be thrown:

    =begin code
    (try say 'abcdefg' ~~ /\w ** {42..10}/  )
        orelse say ($!.^name, $!.empty-range);
        # OUTPUT: «(X::Syntax::Regex::QuantifierValue True)␤»
    (try say 'abcdefg' ~~ /\w ** {Inf..Inf}/)
        orelse say ($!.^name, $!.inf);
        # OUTPUT: «(X::Syntax::Regex::QuantifierValue True)␤»
    (try say 'abcdefg' ~~ /\w ** {NaN..42}/ )
        orelse say ($!.^name, $!.non-numeric-range);
        # OUTPUT: «(X::Syntax::Regex::QuantifierValue True)␤»
    (try say 'abcdefg' ~~ /\w ** {"a".."c"}/)
        orelse say ($!.^name, $!.non-numeric-range);
        # OUTPUT: «(X::Syntax::Regex::QuantifierValue True)␤»
    (try say 'abcdefg' ~~ /\w ** {Inf}/)
        orelse say ($!.^name, $!.inf);
        # OUTPUT: «(X::Syntax::Regex::QuantifierValue True)␤»
    (try say 'abcdefg' ~~ /\w ** {NaN}/)
        orelse say ($!.^name, $!.non-numeric);
        # OUTPUT: «(X::Syntax::Regex::QuantifierValue True)␤»
    =end code

=head2 X<Modified quantifier: C<%>, C<%%>|regex,%;regex,%%>

To more easily match things like comma separated values, you can tack on a
C<%> modifier to any of the above quantifiers to specify a separator that must
occur between each of the matches. For example, C<a+ % ','> will match
C<a> or C<a,a> or C<a,a,a>, etc. To also match trailing delimiters
( C<a,> or C<a,a,> ), you can use C<%%> instead of C<%>.

The quantifier interacts with C<%> and controls the number of overall
repetitions that can match successfully, so C<a* % ','> also matches the empty
string. If you want match words delimited by commas, you might need to nest an
ordinary and a modified quantifier:

    =begin code
    say so 'abc,def' ~~ / ^ [\w+] ** 1 % ',' $ /;  # Output: «False»
    say so 'abc,def' ~~ / ^ [\w+] ** 2 % ',' $ /;  # Output: «True»
    =end code

=head2 X<Preventing backtracking: C<:>|regex,:>

You can prevent backtracking in regexes by attaching a C<:> modifier
to the quantifier:

=begin code
my $str = "ACG GCT ACT An interesting chain";
say $str ~~ /<[ACGT\s]>+ \s+ (<[A..Z a..z \s]>+)/;
# OUTPUT: «｢ACG GCT ACT An interesting chain｣␤ 0 => ｢An interesting chain｣␤»
say $str ~~ /<[ACGT\s]>+: \s+ (<[A..Z a..z \s]>+)/;
# OUTPUT: «Nil␤»
=end code

In the second case, the C<A> in C<An> had already been "absorbed" by the
pattern, preventing the matching of the second part of the pattern, after
C<\s+>. Generally we will want the opposite: prevent backtracking to match
precisely what we are looking for.

In most cases, you will want to prevent backtracking for efficiency reasons, for
instance here:

=for code :preamble<my $str>
    say $str ~~ m:g/[(<[ACGT]> **: 3) \s*]+ \s+ (<[A..Z a..z \s]>+)/;
    # OUTPUT:
    # (｢ACG GCT ACT An interesting chain｣
    # 0 => ｢ACG｣
    # 0 => ｢GCT｣
    # 0 => ｢ACT｣
    # 1 => ｢An interesting chain｣)

Although in this case, eliminating the C<:> from behind C<**> would make it behave exactly in the same way. The best use is to create I<tokens> that will not be backtracked:

    $_ = "ACG GCT ACT IDAQT";
    say  m:g/[(\w+:) \s*]+ (\w+) $$/;
    # OUTPUT:
    # (｢ACG GCT ACT IDAQT｣
    # 0 => ｢ACG｣
    # 0 => ｢GCT｣
    # 0 => ｢ACT｣
    # 1 => ｢IDAQT｣)

Without the C<:> following C<\w+>, the I<ID> part captured would have been
simply C<T>, since the pattern would go ahead and match everything, leaving a
single letter to match the C<\w+> expression at the end of tne line.

=head2 X<Greedy versus frugal quantifiers: C<?>|regex,?>

By default, quantifiers request a greedy match:

    =begin code
    'abababa' ~~ /a .* a/ && say ~$/;   # OUTPUT: «abababa␤»
    =end code

You can attach a C<?> modifier to the quantifier to enable frugal
matching:

    =begin code
    'abababa' ~~ /a .*? a/ && say ~$/;   # OUTPUT: «aba␤»
    =end code

You can also enable frugal matching for general quantifiers:

    say '/foo/o/bar/' ~~ /\/.**?{1..10}\//;  # OUTPUT: «｢/foo/｣␤»
    say '/foo/o/bar/' ~~ /\/.**!{1..10}\//;  # OUTPUT: «｢/foo/o/bar/｣␤»

Greedy matching can be explicitly requested with the C<!> modifier.

=head1 X<Alternation: C<||>|regex,||>

To match one of several possible alternatives, separate them by C<||>; the
first matching alternative wins.

For example, C<ini> files have the following form:

    =begin code :lang<text>
    [section]
    key = value
    =end code

Hence, if you parse a single line of an C<ini> file, it can be either a
section or a key-value pair and the regex would be (to a first
approximation):

    / '[' \w+ ']' || \S+ \s* '=' \s* \S* /

That is, either a word surrounded by square brackets, or a string of
non-whitespace characters, followed by zero or more spaces, followed by the
equals sign C<=>, followed again by optional whitespace, followed by another
string of non-whitespace characters.

Even in non-backtracking contexts, the alternation operator C<||> tries
all the branches in order until the first one matches.

=head1 X<Longest alternation: C<|>|regex,|>

In short, in regex branches separated by C<|>, the longest token match wins,
independent of the textual ordering in the regex. However, what C<|> really
does is more than that.
It does not decide which branch wins after finishing the whole match,
but follows the L<longest-token matching (LTM) strategy|https://design.perl6.org/S05.html#Longest-token_matching>.

Briefly, what C<|> does is this:

=item1 First, select the branch which has the longest declarative prefix.

    say "abc" ~~ /ab | a.* /;                 # Output: ⌜abc⌟
    say "abc" ~~ /ab | a {} .* /;             # Output: ⌜ab⌟
    say "if else" ~~ / if | if <.ws> else /;  # Output: ｢if｣
    say "if else" ~~ / if | if \s+   else /;  # Output: ｢if else｣

As is shown above, C<a.*> is a declarative prefix, while C<a {} .*> terminates
at C<{}>, then its declarative prefix is C<a>. Note that non-declarative atoms
terminate declarative prefix. This is quite important if you want to apply
C<|> in a C<rule>, which automatically enables C<:s>, and C«<.ws>» accidentally
terminates declarative prefix.

=item1 If it's a tie, select the match with the highest specificity.

    say "abc" ~~ /a. | ab { print "win" } /;  # Output: win｢ab｣

When two alternatives match at the same length, the tie is broken by
specificity. That is, C<ab>, as an exact match, counts as closer than C<a.>,
which uses character classes.

=item1 If it's still a tie, use additional tie-breakers.

    say "abc" ~~ /a\w| a. { print "lose" } /; # Output: ⌜ab⌟

If the tie breaker above doesn't work, then the textually earlier alternative
takes precedence.

For more details, see
L<the LTM strategy|https://design.perl6.org/S05.html#Longest-token_matching>.

=head2 Quoted lists are LTM matches

Using a quoted list in a regex is equivalent to specifying the
longest-match alternation of the list's elements.  So, the following
match:

    say 'food' ~~ /< f fo foo food >/;      # OUTPUT: «｢food｣␤»

is equivalent to:

    say 'food' ~~ / f | fo | foo | food /;  # OUTPUT: «｢food｣␤»

Note that the space after the first C«<» is significant here: C«<food>»
calls the named rule C<food> while C«< food >» and C«< food>» specify
quoted lists with a single element, C<'food'>.

Arrays can also be interpolated into a regex to achieve the same effect:

    my @increasingly-edible = <f fo foo food>;
    say 'food' ~~ /@increasingly-edible/;   # OUTPUT: «｢food｣␤»

This is documented further under L<Regex Interpolation|#Regex_interpolation>,
below.

=head1 X<Conjunction: C<&&>|regex,&&>

Matches successfully if all C<&&>-delimited segments match the same substring
of the target string. The segments are evaluated left to right.

This can be useful for augmenting an existing regex. For example if you have
a regex C<quoted> that matches a quoted string, then C</ <quoted> && <-[x]>* />
matches a quoted string that does not contain the character C<x>.

Note that you cannot easily obtain the same behavior with a lookahead, that
is, a regex doesn't consume characters, because a lookahead doesn't stop
looking when the quoted string stops matching.

    =begin code
    say 'abc' ~~ / <?before a> && . /;    # OUTPUT: «Nil␤»
    say 'abc' ~~ / <?before a> . && . /;  # OUTPUT: «｢a｣␤»
    say 'abc' ~~ / <?before a> . /;       # OUTPUT: «｢a｣␤»
    say 'abc' ~~ / <?before a> .. /;      # OUTPUT: «｢ab｣␤»
    =end code

=head1 X<Conjunction: C<&>|regex,&>

Much like C<&&> in a regex, it matches successfully if all segments
separated by C<&> match the same part of the target string.

C<&> (unlike C<&&>) is considered declarative, and notionally all the
segments can be evaluated in parallel, or in any order the compiler chooses.

=head1 Anchors

Regexes search an entire string for matches. Sometimes this is not what
you want. Anchors match only at certain positions in the string, thereby
anchoring the regex match to that position.

=head2 X<Start of string and end of string|regex,^;regex,$>

The C<^> anchor only matches at the start of the string:

    say so 'properly' ~~ /  perl/;    # OUTPUT: «True␤»
    say so 'properly' ~~ /^ perl/;    # OUTPUT: «False␤»
    say so 'perly'    ~~ /^ perl/;    # OUTPUT: «True␤»
    say so 'perl'     ~~ /^ perl/;    # OUTPUT: «True␤»

The C<$> anchor only matches at the end of the string:

    say so 'use perl' ~~ /  perl  /;   # OUTPUT: «True␤»
    say so 'use perl' ~~ /  perl $/;   # OUTPUT: «True␤»
    say so 'perly'    ~~ /  perl $/;   # OUTPUT: «False␤»

You can combine both anchors:

    say so 'use perl' ~~ /^ perl $/;   # OUTPUT: «False␤»
    say so 'perl'     ~~ /^ perl $/;   # OUTPUT: «True␤»

Keep in mind that C<^> matches the start of a B<string>, not the start of
a B<line>. Likewise, C<$> matches the end of a B<string>, not the end of
a B<line>.

The following is a multi-line string:

    my $str = chomp q:to/EOS/;
       Keep it secret
       and keep it safe
       EOS

    # 'safe' is at the end of the string
    say so $str ~~ /safe   $/;   # OUTPUT: «True␤»

    # 'secret' is at the end of a line, not the string
    say so $str ~~ /secret $/;   # OUTPUT: «False␤»

    # 'Keep' is at the start of the string
    say so $str ~~ /^Keep   /;   # OUTPUT: «True␤»

    # 'and' is at the start of a line -- not the string
    say so $str ~~ /^and    /;   # OUTPUT: «False␤»

=head2 X<Start of line and end of line|regex,^^;regex,$$>

The C<^^> anchor matches at the start of a logical line. That is, either
at the start of the string, or after a newline character. However, it does not
match at the end of the string, even if it ends with a newline character.

The C<$$> anchor matches at the end of a logical line. That is, before a newline
character, or at the end of the string when the last character is not a
newline character.

To understand the following example, it's important to know that the
C<q:to/EOS/...EOS> L<heredoc|/language/quoting#Heredocs:_:to> syntax removes
leading indention to the same level as the C<EOS> marker, so that the first,
second and last lines have no leading space and the third and fourth lines have
two leading spaces each.

    my $str = q:to/EOS/;
        There was a young man of Japan
        Whose limericks never would scan.
          When asked why this was,
          He replied "It's because I always try to fit
        as many syllables into the last line as ever I possibly can."
        EOS

    # 'There' is at the start of string
    say so $str ~~ /^^ There/;        # OUTPUT: «True␤»

    # 'limericks' is not at the start of a line
    say so $str ~~ /^^ limericks/;    # OUTPUT: «False␤»

    # 'as' is at start of the last line
    say so $str ~~ /^^ as/;            # OUTPUT: «True␤»

    # there are blanks between start of line and the "When"
    say so $str ~~ /^^ When/;         # OUTPUT: «False␤»

    # 'Japan' is at end of first line
    say so $str ~~ / Japan $$/;       # OUTPUT: «True␤»

    # there's a . between "scan" and the end of line
    say so $str ~~ / scan $$/;        # OUTPUT: «False␤»

    # matched at the last line
    say so $str ~~ / '."' $$/;        # OUTPUT: «True␤»

=head2 X«Word boundary|regex, <|w>;regex, <!|w>»

To match any word boundary, use C«<|w>» or C«<?wb>». This is similar to
X«C<\b>|regex deprecated,\b» of other languages.

To match not a word boundary, use C«<!|w>» or C«<!wb>». This is similar to
X«C<\B>|regex deprecated,\B» of other languages.

These are both zero-width regex elements.

    say "two-words" ~~ / two<|w>\-<|w>words /;    # OUTPUT: «｢two-words｣␤»
    say "twowords" ~~ / two<!|w><!|w>words /;     # OUTPUT: «｢twowords｣»


X<<<|regex, <<; regex, >>; regex, «; regex, »>>>
=head2 Left and right word boundary

C«<<» matches a left word boundary. It matches positions where there is a
non-word character at the left, or the start of the string, and a word
character to the right.

C«>>» matches a right word boundary. It matches positions where there
is a word character at the left and a non-word character at the right, or
the end of the string.

These are both zero-width regex elements.

    my $str = 'The quick brown fox';
    say so ' ' ~~ /\W/;               # OUTPUT: «True␤»
    say so $str ~~ /br/;              # OUTPUT: «True␤»
    say so $str ~~ /<< br/;           # OUTPUT: «True␤»
    say so $str ~~ /br >>/;           # OUTPUT: «False␤»
    say so $str ~~ /own/;             # OUTPUT: «True␤»
    say so $str ~~ /<< own/;          # OUTPUT: «False␤»
    say so $str ~~ /own >>/;          # OUTPUT: «True␤»
    say so $str ~~ /<< The/;          # OUTPUT: «True␤»
    say so $str ~~ /fox >>/;          # OUTPUT: «True␤»

You can also use the variants C<«> and C<»> :

    my $str = 'The quick brown fox';
    say so $str ~~ /« own/;          # OUTPUT: «False␤»
    say so $str ~~ /own »/;          # OUTPUT: «True␤»

To see the difference between C«<|w>» and C<«>, C<»>:

    say "stuff here!!!".subst(:g, />>/, '|');   # OUTPUT: «stuff| here|!!!␤»
    say "stuff here!!!".subst(:g, /<</, '|');   # OUTPUT: «|stuff |here!!!␤»
    say "stuff here!!!".subst(:g, /<|w>/, '|'); # OUTPUT: «|stuff| |here|!!!␤»

=head2 Summary of anchors

Anchors are zero-width regex elements. Hence they do not use up a character of
the input string, that is, they do not advance the current position at which
the regex engine tries to match. A good mental model is that they match between
two characters of an input string, or before the first, or after the last
character of an input string.

=begin table

    Anchor  | Description         | Examples
    ========+=====================+=================
    ^       | Start of string     | "⏏two\nlines"
    ^^      | Start of line       | "⏏two\n⏏lines"
    $       | End of string       | "two\nlines⏏"
    $$      | End of line         | "two⏏\nlines⏏"
    << or « | Left word boundary  | "⏏two ⏏words"
    >> or » | Right word boundary | "two⏏ words⏏"
    <?wb>   | Any word boundary   | "⏏two⏏ ⏏words⏏~!"
    <!wb>   | Not a word boundary | "t⏏w⏏o w⏏o⏏r⏏d⏏s~⏏!"
    <?ww>   | Within word         | "t⏏w⏏o w⏏o⏏r⏏d⏏s~!"
    <!ww>   | Not within word     | "⏏two⏏ ⏏words⏏~⏏!⏏"

=end table

=head1 Zero-width assertions

Zero-Width Assertions can help you implement your own anchor.

A zero-width assertion turns another regex into an anchor, making
them consume no characters of the input string. There are two variants:
lookahead and lookbehind assertions.

Technically, anchors are also zero-width assertions, and they can look
both ahead and behind.

=head2 X<Lookahead assertions|regex,before>

To check that a pattern appears before another pattern, use a
lookahead assertion via the C<before> assertion. This has the form:

    <?before pattern>

Thus, to search for the string C<foo> which is immediately followed by the
string C<bar>, use the following regexp:

    / foo <?before bar> /

For example:

    say "foobar" ~~ / foo <?before bar> /;  # OUTPUT: «foo␤»

However, if you want to search for a pattern which is B<not> immediately
followed by some pattern, then you need to use a negative lookahead
assertion, this has the form:

    <!before pattern>

In the following example, all occurrences of C<foo> which is not before
C<bar> would match with

    say "foobaz" ~~ / foo <!before bar> /;  # OUTPUT: «foo␤»

Lookahead assertions can be used also with other patterns, like
characters ranges, interpolated variables, subscripts and so on. In such
cases it does suffice to use a C<?>, or a C<!> for the negate form.
For instance, the following lines all produce the very same result:

    say 'abcdefg' ~~ rx{ abc <?before def> };        # OUTPUT: ｢abc｣
    say 'abcdefg' ~~ rx{ abc <?[ d..f ]> };          # OUTPUT: ｢abc｣
    my @ending_letters = <d e f>;
    say 'abcdefg' ~~ rx{ abc <?@ending_letters> };   # OUTPUT: ｢abc｣

A practical use of lookahead assertions is in substitutions, where
you only want to substitute regex matches that are in a certain
context. For example, you might want to substitute only numbers
that are followed by a unit (like I<kg>), but not other numbers:

    my @units = <kg m km mm s h>;
    $_ = "Please buy 2 packs of sugar, 1 kg each";
    s:g[\d+ <?before \s* @units>] = 5 * $/;
    say $_;         # OUTPUT: Please buy 2 packs of sugar, 5 kg each

Since the lookahead is not part of the match object, the unit
is not substituted.

=head2 X<Lookbehind assertions|regex,after>

To check that a pattern appears after another pattern, use a
lookbehind assertion via the C<after> assertion. This has the form:

    <?after pattern>

Therefore, to search for the string C<bar> immediately preceded by the
string C<foo>, use the following regexp:

    / <?after foo> bar /

For example:

    say "foobar" ~~ / <?after foo> bar /;   # OUTPUT: «bar␤»

However, if you want to search for a pattern which is B<not> immediately
preceded by some pattern, then you need to use a negative lookbehind
assertion, this has the form:

    <!after pattern>

Hence all occurrences of C<bar> which do not have C<foo> before them
would be matched by

    say "fotbar" ~~ / <!after foo> bar /;    # OUTPUT: «bar␤»

These are, as in the case of lookahead, zero-width assertions which do not I<consume> characters, like here:

    say "atfoobar" ~~ / (.**3) .**2 <?after foo> bar /;
    # OUTPUT: «｢atfoobar｣␤ 0 => ｢atf｣␤»

where we capture the first 3 of the 5 characters before bar, but only if C<bar>
is preceded by C<foo>. The fact that the assertion is zero-width allows us to
use part of the characters in the assertion for capture.


=head1 Grouping and capturing

In regular (non-regex) Perl 6, you can use parentheses to group things
together, usually to override operator precedence:

    say 1 + 4 * 2;     # OUTPUT: «9␤», parsed as 1 + (4 * 2)
    say (1 + 4) * 2;   # OUTPUT: «10␤»

The same grouping facility is available in regexes:

    / a || b c /;      # matches 'a' or 'bc'
    / ( a || b ) c /;  # matches 'ac' or 'bc'

The same grouping applies to quantifiers:

    / a b+ /;          # matches an 'a' followed by one or more 'b's
    / (a b)+ /;        # matches one or more sequences of 'ab'
    / (a || b)+ /;     # matches a string of 'a's and 'b's, except empty string

An unquantified capture produces a L<Match> object. When a capture is
quantified (except with the C<?> quantifier) the capture becomes a list of
L<Match> objects instead.

=head2 X«Capturing|regex,( )»

The round parentheses don't just group, they also I<capture>; that is, they
make the string matched within the group available as a variable, and also as
an element of the resulting L<Match> object:

    my $str =  'number 42';
    if $str ~~ /'number ' (\d+) / {
        say "The number is $0";         # OUTPUT: The number is 42
        # or
        say "The number is $/[0]";      # OUTPUT: The number is 42
    }

Pairs of parentheses are numbered left to right, starting from zero.

    if 'abc' ~~ /(a) b (c)/ {
        say "0: $0; 1: $1";             # OUTPUT: «0: a; 1: c␤»
    }

The C<$0> and C<$1> etc. syntax is shorthand. These captures
are canonically available from the match object C<$/> by using it as a list,
so C<$0> is actually syntactic sugar for C<$/[0]>.

Coercing the match object to a list gives an easy way to programmatically
access all elements:

    if 'abc' ~~ /(a) b (c)/ {
        say $/.list.join: ', '  # OUTPUT: «a, c␤»
    }

=head2 X«Non-capturing grouping|regex,[ ]»

The parentheses in regexes perform a double role: they group the regex
elements inside and they capture what is matched by the sub-regex inside.

To get only the grouping behavior, you can use square brackets C<[ ... ]>
instead.

    if 'abc' ~~ / [a||b] (c) / {
        say ~$0;                # OUTPUT: «c␤»
    }

If you do not need the captures, using non-capturing C<[ ... ]>
groups provides the following benefits:
=item  they more cleanly communicate the regex intent;
=item  they make it easier to count the capturing groups that do mean;
=item  they make matching a bit faster.

=head2 Capture numbers

It is stated above that captures are numbered from left to right. While true
in principle, this is also over simplification.

The following rules are listed for the sake of completeness. When you find
yourself using them regularly, it's worth considering named captures (and
possibly subrules) instead.

Alternations reset the capture count:

    / (x) (y)  || (a) (.) (.) /
    # $0  $1      $0  $1  $2

Example:

    if 'abc' ~~ /(x)(y) || (a)(.)(.)/ {
        say ~$1;        # OUTPUT: «b␤»
    }

If two (or more) alternations have a different number of captures,
the one with the most captures determines the index of the next capture:

    if 'abcd' ~~ / a [ b (.) || (x) (y) ] (.) / {
        #                 $0     $0  $1    $2
        say ~$2;            # OUTPUT: «d␤»
    }

Captures can be nested, in which case they are numbered per level

    if 'abc' ~~ / ( a (.) (.) ) / {
        say "Outer: $0";                # OUTPUT: Outer: abc
        say "Inner: $0[0] and $0[1]";   # OUTPUT: Inner: b and c
    }

If you need to refer to a capture from within another capture, store
it in a variable first:

    # !!WRONG!! The $0 refers to a capture *inside* the second capture
    say "11" ~~ /(\d) ($0)/; # OUTPUT: «Nil␤»

    # CORRECT: $0 is saved into a variable outside the second capture
    # before it is used inside (the `{}` is needed to update the current match)
    say "11" ~~ /(\d) {} :my $c = $0; ($c)/; # OUTPUT: «｢11｣␤ 0 => ｢1｣␤ 1 => ｢1｣␤»
    say "Matched $c"; # OUTPUT: «␤Matched 1␤»


X<|:my>
C<:my> helps scoping the C<$c> variable within the regex and beyond; in this
case we can use it in the next sentence to show what has been matched inside the
regex. This can be used for debugging inside regular expressions, for instance:

    my $paragraph="line\nline2\nline3";
    $paragraph ~~ rx| :my $counter = 0; ( \V* { ++$counter } ) *%% \n |;
    say "Matched $counter lines"; # OUTPUT: «Matched 3 lines␤»

X<|:our>
The C<:our>, similarly to L<C<our>|/syntax/our> in classes, can be used in L<Grammar>s to declare variables that can be accessed, via its fully qualified name, from outside the grammar:

=begin code
grammar HasOur {
    token TOP {
        :our $our = 'Þor';
        $our \s+ is \s+ mighty
    }
}

say HasOur.parse('Þor is mighty'); # OUTPUT: «｢Þor is mighty｣␤»
say $HasOur::our;                  # OUTPUT: «Þor␤»
=end code

Once the parsing has been done successfully, we use the FQN name of the C<$our>
variable to access its value, that can be none other than C<Þor>.

=head2 X<Named captures|regex, Named captures>

Instead of numbering captures, you can also give them names. The generic,
and slightly verbose, way of naming captures is like this:

    if 'abc' ~~ / $<myname> = [ \w+ ] / {
        say ~$<myname>      # OUTPUT: «abc␤»
    }

The access to the named capture, C«$<myname>», is a shorthand for indexing
the match object as a hash, in other words: C<$/{ 'myname' }> or C«$/<myname>».

Named captures can also be nested using regular capture group syntax:

    if 'abc-abc-abc' ~~ / $<string>=( [ $<part>=[abc] ]* % '-' ) / {
        say ~$<string>;          # OUTPUT: «abc-abc-abc␤»
        say ~$<string><part>;    # OUTPUT: «abc abc abc␤»
        say ~$<string><part>[0]; # OUTPUT: «abc␤»
    }

Coercing the match object to a hash gives you easy programmatic access to
all named captures:

    if 'count=23' ~~ / $<variable>=\w+ '=' $<value>=\w+ / {
        my %h = $/.hash;
        say %h.keys.sort.join: ', ';        # OUTPUT: «value, variable␤»
        say %h.values.sort.join: ', ';      # OUTPUT: «23, count␤»
        for %h.kv -> $k, $v {
            say "Found value '$v' with key '$k'";
            # outputs two lines:
            #   Found value 'count' with key 'variable'
            #   Found value '23' with key 'value'
        }
    }

A more convenient way to get named captures is discussed in
the L<Subrules|#Subrules> section.

=head2 X«Capture markers: C«<( )>»|regex,<( )>»

A C«<(» token indicates the start of the match's overall capture, while the
corresponding C«)>» token indicates its endpoint. The C«<(» is similar to other
languages X<\K|regex deprecated,\K> to discard any matches found before the
C<\K>.

    say 'abc' ~~ / a <( b )> c/;            # OUTPUT: «｢b｣␤»
    say 'abc' ~~ / <(a <( b )> c)>/;        # OUTPUT: «｢bc｣␤»

As in the example above, you can see C«<(» sets the start point and C«)>» sets
the endpoint; since they are actually independent of each other, the inner-most
start point wins (the one attached to C<b>) and the outer-most end wins (the one
attached to C<c>).

=head1 Substitution

Regular expressions can also be used to substitute one piece of text for
another. You can use this for anything, from correcting a spelling error
(e.g., replacing 'Perl Jam' with 'Pearl Jam'), to reformatting an ISO8601
date from C<yyyy-mm-ddThh:mm:ssZ> to C<mm-dd-yy h:m {AM,PM}> and beyond.

Just like the search-and-replace editor's dialog box, the C<s/ / /> operator
has two sides, a left and right side. The left side is where your matching
expression goes, and the right side is what you want to replace it with.

=head2 X<Lexical conventions|quote,s/ / />

Substitutions are written similarly to matching, but the substitution operator
has both an area for the regex to match, and the text to substitute:

    =begin code :preamble<my $str;>
    s/replace/with/;           # a substitution that is applied to $_
    $str ~~ s/replace/with/;   # a substitution applied to a scalar
    =end code

The substitution operator allows delimiters other than the slash:

    s|replace|with|;
    s!replace!with!;
    s,replace,with,;

Note that neither the colon C<:> nor balancing delimiters such as C<{}> or
C<()> can be substitution delimiters. Colons clash with adverbs such as
C<s:i/Foo/bar/> and the other delimiters are used for other purposes.

If you use balancing brackets/braces/parentheses, the substitution works like
this instead:

    s[replace] = 'with';

The right-hand side is now a (not quoted) Perl 6 expression, in which C<$/>
is available as the current match:

    $_ = 'some 11 words 21';
    s:g[ \d+ ] =  2 * $/;
    .say;                    # OUTPUT: «some 22 words 42␤»

Like the C<m//> operator, whitespace is ignored in the regex part of a
substitution. Comments, as in Perl 6 in general, start with the hash
character C<#> and go to the end of the current line.

=head2 Replacing string literals

The simplest thing to replace is a string literal. The string you want to
replace goes on the left-hand side of the substitution operator, and the string
you want to replace it with goes on the right-hand side; for example:

    $_ = 'The Replacements';
    s/Replace/Entrap/;
    .say;                    # OUTPUT: «The Entrapments␤»

Alphanumeric characters and the underscore are literal matches, just as in its
cousin the C<m//> operator. All other characters must be escaped with a
backslash C<\> or included in quotes:

    $_ = 'Space: 1999';
    s/Space\:/Party like it's/;
    .say                        # OUTPUT: «Party like it's 1999␤»

Note that the matching restrictions only apply to the left-hand side of the
substitution expression.

By default, substitutions are only done on the first match:

    $_ = 'There can be twly two';
    s/tw/on/;                     # replace 'tw' with 'on' once
    .say;                         # OUTPUT: «There can be only two␤»

=head2 Wildcards and character classes

Anything that can go into the C<m//> operator can go into the left-hand side
of the substitution operator, including wildcards and character classes. This
is handy when the text you're matching isn't static, such as trying to match
a number in the middle of a string:

    $_ = "Blake's 9";
    s/\d+/7/;         # replace any sequence of digits with '7'
    .say;             # OUTPUT: «Blake's 7␤»

Of course, you can use any of the C<+>, C<*> and C<?> modifiers, and they'll
behave just as they would in the C<m//> operator's context.

=head2 Capturing groups

Just as in the match operator, capturing groups are allowed on the left-hand
side, and the matched contents populate the C<$0>..C<$n> variables and the
C<$/> object:

    $_ = '2016-01-23 18:09:00';
    s/ (\d+)\-(\d+)\-(\d+) /today/;   # replace YYYY-MM-DD with 'today'
    .say;                             # OUTPUT: «today 18:09:00␤»
    "$1-$2-$0".say;                   # OUTPUT: «01-23-2016␤»
    "$/[1]-$/[2]-$/[0]".say;          # OUTPUT: «01-23-2016␤»

Any of these variables C<$0>, C<$1>, C<$/> can be used on the right-hand side
of the operator as well, so you can manipulate what you've just matched. This
way you can separate out the C<YYYY>, C<MM> and C<DD> parts of a date and
reformat them into C<MM-DD-YYYY> order:

    $_ = '2016-01-23 18:09:00';
    s/ (\d+)\-(\d+)\-(\d+) /$1-$2-$0/;    # transform YYYY-MM-DD to MM-DD-YYYY
    .say;                                 # OUTPUT: «01-23-2016 18:09:00␤»

Named capture can be used too:

    $_ = '2016-01-23 18:09:00';
    s/ $<y>=(\d+)\-$<m>=(\d+)\-$<d>=(\d+) /$<m>-$<d>-$<y>/;
    .say;                                 # OUTPUT: «01-23-2016 18:09:00␤»

Since the right-hand side is effectively a regular Perl 6 interpolated string,
you can reformat the time from C<HH:MM> to C<h:MM {AM,PM}> like so:

    $_ = '18:38';
    s/(\d+)\:(\d+)/{$0 % 12}\:$1 {$0 < 12 ?? 'AM' !! 'PM'}/;
    .say;                                 # OUTPUT: «6:38 PM␤»

Using the modulo C<%> operator above keeps the sample code under 80 characters,
but is otherwise the same as C« $0 < 12 ?? $0 !! $0 - 12 ». When combined with
the power of the Parser Expression Grammars that B<really> underlies what
you're seeing here, you can use "regular expressions" to parse pretty much any
text out there.

=head2 Common adverbs

The full list of adverbs that you can apply to regular expressions can be found
elsewhere in this document (L<section Adverbs|#Adverbs>), but the most
common are probably C<:g> and C<:i>.

=item Global adverb C<:g>

Ordinarily, matches are only made once in a given string, but adding the
C<:g> modifier overrides that behavior, so that substitutions are made
everywhere possible. Substitutions are non-recursive; for example:

    $_ = q{I can say "banana" but I don't know when to stop};
    s:g/na/nana,/;    # substitute 'nana,' for 'na'
    .say;             # OUTPUT: «I can say "banana,nana," but I don't ...␤»

Here, C<na> was found twice in the original string and each time there was a
substitution. The substitution only
applied to the original string, though. The resulting string was not impacted.

=item Insensitive adverb C<:i>

Ordinarily, matches are case-sensitive. C<s/foo/bar/> will only match
C<'foo'> and not C<'Foo'>. If the adverb C<:i> is used, though, matches become
case-insensitive.

    $_ = 'Fruit';
    s/fruit/vegetable/;
    .say;                          # OUTPUT: «Fruit␤»

    s:i/fruit/vegetable/;
    .say;                          # OUTPUT: «vegetable␤»

For more information on what these adverbs are actually
doing, refer to the L<section Adverbs|#Adverbs> section of this document.

These are just a few of the transformations you can apply with the substitution
operator. Some of the simpler uses in the real world include removing personal
data from log files, editing MySQL timestamps into PostgreSQL format, changing
copyright information in HTML files and sanitizing form fields in a web
application.

As an aside, novices to regular expressions often get overwhelmed and think
that their regular expression needs to match every piece of data in the line,
including what they want to match. Write just enough to match the data you're
looking for, no more, no less.

=head1 X<Tilde for nesting structures|regex, tilde; regex, ~>

The C<~> operator is a helper for matching nested subrules with a
specific terminator as the goal. It is designed to be placed between
an opening and closing bracket, like so:

    / '(' ~ ')' <expression> /

However, it mostly ignores the left argument, and operates on the next
two atoms (which may be quantified). Its operation on those next two
atoms is to "twiddle" them so that they are actually matched in
reverse order. Hence the expression above, at first blush, is merely
shorthand for:

    / '(' <expression> ')' /

But beyond that, when it rewrites the atoms it also inserts the
apparatus that will set up the inner expression to recognize the
terminator, and to produce an appropriate error message if the inner
expression does not terminate on the required closing atom. So it
really does pay attention to the left bracket as well, and it actually
rewrites our example to something more like:

X<|SETGOAL>
    =begin code :skip-test
    $<OPEN> = '(' <SETGOAL: ')'> <expression> [ $GOAL || <FAILGOAL> ]
    =end code

X<FAILGOAL> is a special method that can be defined by the user and it
will be called on parse failure:

    grammar A { token TOP { '[' ~ ']' \w+  };
                method FAILGOAL($goal) {
                    die "Cannot find $goal near position {self.pos}"
                }
    }

    say A.parse: '[good]';  # OUTPUT: «｢[good]｣␤»
    A.parse: '[bad';        # will throw FAILGOAL exception
    CATCH { default { put .^name, ': ', .Str } };
    # OUTPUT: «X::AdHoc: Cannot find ']'  near position 4␤»

Note that you can use this construct to set up expectations for a
closing construct even when there's no opening bracket:

    "3)"  ~~ / <?> ~ ')' \d+ /;  # RESULT: «｢3)｣»
    "(3)" ~~ / <?> ~ ')' \d+ /;  # RESULT: «｢3)｣»

Here <?> returns true on the first null string.

The order of the regex capture is original:

    "abc" ~~ /a ~ (c) (b)/;
    say $0; # OUTPUT: «｢c｣␤»
    say $1; # OUTPUT: «｢b｣␤»

=head1 X<Subrules|declarator,regex>

Just like you can put pieces of code into subroutines, you can also put
pieces of regex into named rules.

    my regex line { \N*\n }
    if "abc\ndef" ~~ /<line> def/ {
        say "First line: ", $<line>.chomp;      # OUTPUT: «First line: abc␤»
    }

A named regex can be declared with C<my regex named-regex { body here }>, and
called with C«<named-regex>». At the same time, calling a named regex
installs a named capture with the same name.

To give the capture a different name from the regex, use the syntax
C«<capture-name=named-regex>». If no capture is desired, a leading dot
or ampersand will suppress it: C«<.named-regex>» if it is a method
declared in the same class or grammar, C«<&named-regex>» for a regex
declared in the same lexical context.

Here's more complete code for parsing C<ini> files:

    my regex header { \s* '[' (\w+) ']' \h* \n+ }
    my regex identifier  { \w+ }
    my regex kvpair { \s* <key=identifier> '=' <value=identifier> \n+ }
    my regex section {
        <header>
        <kvpair>*
    }

    my $contents = q:to/EOI/;
        [passwords]
            jack=password1
            joy=muchmoresecure123
        [quotas]
            jack=123
            joy=42
    EOI

    my %config;
    if $contents ~~ /<section>*/ {
        for $<section>.list -> $section {
            my %section;
            for $section<kvpair>.list -> $p {
                %section{ $p<key> } = ~$p<value>;
            }
            %config{ $section<header>[0] } = %section;
        }
    }
    say %config.perl;

    # OUTPUT: «{:passwords(${:jack("password1"), :joy("muchmoresecure123")}),
    #           :quotas(${:jack("123"), :joy("42")})}»

Named regexes can and should be grouped in L<grammars|/language/grammars>. A
list of predefined subrules is listed in
L<S05-regex|https://design.perl6.org/S05.html#Predefined_Subrules> of design
documents.

=head1 X<Regex interpolation|regex, Regex Interpolation>

If you want to build a regex using a pattern given at runtime, regex
interpolation is what you are looking for.

There are four ways you can interpolate a string into regex as a pattern.
That is using C«$pattern», C«$($pattern)», C«<$pattern>» or
C«<{$pattern.method}>».

If the variable to be interpolated is statically typed as a C«Str» or C«str»
(like C«$pattern0» and C«$pattern2» are below) and only interpolated literally
(like the first example below), than the compiler can optimize that and it runs
much faster.

    my Str $text = 'camelia';
    my Str $pattern0 = 'camelia';
    my     $pattern1 = 'ailemac';
    my str $pattern2 = '\w+';

    say $text ~~ / $pattern0 /;                # OUTPUT: «｢camelia｣␤»
    say $text ~~ / $($pattern0) /;             # OUTPUT: «｢camelia｣␤»
    say $text ~~ / $($pattern1.flip) /;        # OUTPUT: «｢camelia｣␤»
    say 'ailemacxflip' ~~ / $pattern1.flip /;  # OUTPUT: «｢ailemacxflip｣␤»
    say '\w+' ~~ / $pattern2 /;                # OUTPUT: «｢\w+｣␤»
    say '\w+' ~~ / $($pattern2) /;             # OUTPUT: «｢\w+｣␤»

    say $text ~~ / <{$pattern1.flip}> /;       # OUTPUT: «｢camelia｣␤»
    # say $text ~~ / <$pattern1.flip> /;       # !!Compile Error!!
    say $text ~~ / <$pattern2> /;              # OUTPUT: «｢camelia｣␤»
    say $text ~~ / <{$pattern2}> /;            # OUTPUT: «｢camelia｣␤»

Note that the first two syntax interpolate the string lexically, while
C«<$pattern>» and C«<{$pattern.method}>» causes L«implicit EVAL|
/language/traps#<{$x}>_vs_$($x):_Implicit_EVAL», which is a known trap.

When an array variable is interpolated into a regex, the regex engine handles it
like a C<|> alternative of the regex elements (see the documentation on
L<embedded lists|#Quoted lists are LTM matches>, above).  The interpolation
rules for individual elements are the same as for scalars, so strings and
numbers match literally, and L</type/Regex|Regex> objects match as regexes. Just
as with ordinary C<|> interpolation, the longest match succeeds:

    my @a = '2', 23, rx/a.+/;
    say ('b235' ~~ /  b @a /).Str;      # OUTPUT: «b23»

The use of hash variables in regexes is preserved.

=head2 Regex boolean condition check

The special operator C«<?{}>» allows the evaluation of a boolean expression that
can perform a semantic evaluation of the match before the regular expression
continues. In other words, it is possible to check in a boolean context a part
of a regular expression and therefore invalidate the whole match (or allow it to
continue) even if the match succeeds from a syntactic point of view.

In particular the C«<?{}>» operator requires a C<True> value in order to allow the regular expression to match, while its
negated form  C«<!{}>» requires a C<False> value.

In order to demonstrate the above operator, please consider the following
example that involves a simple IPv4 address matching:

=begin code
my $localhost = '127.0.0.1';
my regex ipv4-octet { \d ** 1..3 <?{ True }> }
$localhost ~~ / ^ <ipv4-octet> ** 4 % "." $ /;
say $/<ipv4-octet>;   # OUTPUT: [｢127｣ ｢0｣ ｢0｣ ｢1｣]
=end code

The C<octet> regular expression matches against a number made by one up to three
digits. Each match is driven by the result of the C«<?{}>», that being the fixed
value of C<True> means that the regular expression match has to be always
considered as good. As a counter-example, using the special constant value
C<False> will invalidate the match even if the regular expression matches from a
syntactic point of view:

=begin code
my $localhost = '127.0.0.1';
my regex ipv4-octet { \d ** 1..3 <?{ False }> }
$localhost ~~ / ^ <ipv4-octet> ** 4 % "." $ /;
say $/<ipv4-octet>;   # OUTPUT: Nil
=end code

From the above examples, it should be clear that it is possible to improve the
semantic check, for instance ensuring that each I<octet> is really a valid IPv4
octet:

=begin code
my $localhost = '127.0.0.1';
my regex ipv4-octet { \d ** 1..3 <?{ $/.Int <= 255 && $/.Int >= 0 }> }
$localhost ~~ / ^ <ipv4-octet> ** 4 % "." $ /;
say $/<ipv4-octet>;   # OUTPUT: [｢127｣ ｢0｣ ｢0｣ ｢1｣]
=end code

Please note that it is not required to evaluate the regular expression in-line,
but also a regular method can be called to get the boolean value:

=begin code
my $localhost = '127.0.0.1';
sub check-octet ( Int $o ){ $o <= 255 && $o >= 0 }
my regex ipv4-octet { \d ** 1..3 <?{ &check-octet( $/.Int ) }> }
$localhost ~~ / ^ <ipv4-octet> ** 4 % "." $ /;
say $/<ipv4-octet>;   # OUTPUT: [｢127｣ ｢0｣ ｢0｣ ｢1｣]
=end code

Of course, being C«<!{}>» the negation form of C«<?{}>» the same boolean
evaluation can be rewritten in a negated form:

=begin code
my $localhost = '127.0.0.1';
sub invalid-octet( Int $o ){ $o < 0 || $o > 255 }
my regex ipv4-octet { \d ** 1..3 <!{ &invalid-octet( $/.Int ) }> }
$localhost ~~ / ^ <ipv4-octet> ** 4 % "." $ /;
say $/<ipv4-octet>;   # OUTPUT: [｢127｣ ｢0｣ ｢0｣ ｢1｣]
=end code

=head1 Adverbs

Adverbs modify how regexes work and provide convenient shortcuts for
certain kinds of recurring tasks.

There are two kinds of adverbs: regex adverbs apply at the point where a
regex is defined and matching adverbs apply at the point that a regex
matches against a string.

This distinction often blurs, because matching and declaration are often
textually close but using the method form of matching makes the distinction
clear.

C<'abc' ~~ /../> is roughly equivalent to C<'abc'.match(/../)>, or even more
clearly written in separate lines:

    my $regex = /../;           # definition
    if 'abc'.match($regex) {    # matching
        say "'abc' has at least two characters";
    }

Regex adverbs like C<:i> go into the definition line and matching adverbs
like C<:overlap> are appended to the match call:

    my $regex = /:i . a/;
    for 'baA'.match($regex, :overlap) -> $m {
        say ~$m;
    }
    # OUTPUT: «ba␤aA␤»

=head2 X<Regex adverbs|regex adverb,:ignorecase;regex adverb,:i>

Adverbs that appear at the time of a regex declaration are part of the
actual regex and influence how the Perl 6 compiler translates the regex into
binary code.

For example, the C<:ignorecase> (C<:i>) adverb tells the compiler to ignore
the distinction between upper case, lower case and title case letters.

So C<'a' ~~ /A/> is false, but C<'a' ~~ /:i A/> is a successful match.

Regex adverbs can come before or inside a regex declaration and only affect
the part of the regex that comes afterwards, lexically.
Note that regex adverbs appearing before the regex must appear after
something that introduces the regex to the parser, like 'rx' or 'm' or a bare '/'.
This is NOT valid:

    =begin code :skip-test
    my $rx1 = :i/a/;      # adverb is before the regex is recognized => exception
    =end code

but these are valid:

    my $rx1 = rx:i/a/;     # before
    my $rx2 = m:i/a/;      # before
    my $rx3 = /:i a/;      # inside

These two regexes are equivalent:

    my $rx1 = rx:i/a/;      # before
    my $rx2 = rx/:i a/;     # inside

Whereas these two are not:

    my $rx3 = rx/a :i b/;   # matches only the b case insensitively
    my $rx4 = rx/:i a b/;   # matches completely case insensitively

Square brackets and parentheses limit the scope of an adverb:

    / (:i a b) c /;         # matches 'ABc' but not 'ABC'
    / [:i a b] c /;         # matches 'ABc' but not 'ABC'

=head3 X<Ignoremark|regex adverb,:ignoremark;regex adverb,:m>

The C<:ignoremark> or C<:m> adverb instructs the regex engine to only
compare base characters, and ignore additional marks such as combining
accents:

    =begin code
    say so 'a' ~~ rx/ä/;                # OUTPUT: «False»
    say so 'a' ~~ rx:ignoremark /ä/;    # OUTPUT: «True»
    say so 'ỡ' ~~ rx:ignoremark /o/;    # OUTPUT: «True>
    =end code

=head3 X<Ratchet|regex adverb,:ratchet;regex adverb,:r>

The C<:ratchet> or C<:r> adverb causes the regex engine to not backtrack
(see L<backtracking|/language/regexes#Backtracking>). Mnemonic: a
L<ratchet|https://en.wikipedia.org/wiki/Ratchet_%28device%29> only moves
in one direction and can't backtrack.

Without this adverb, parts of a regex will try different ways to match a
string in order to make it possible for other parts of the regex to match.
For example, in C<'abc' ~~ /\w+ ./>, the C<\w+> first eats up the whole
string, C<abc> but then the C<.> fails. Thus C<\w+> gives up a character,
matching only C<ab>, and the C<.> can successfully match the string C<c>.
This process of giving up characters (or in the case of alternations, trying
a different branch) is known as backtracking.

    say so 'abc' ~~ / \w+ . /;        # OUTPUT: «True␤»
    say so 'abc' ~~ / :r \w+ . /;     # OUTPUT: «False␤»

Ratcheting can be an optimization, because backtracking is costly. But more
importantly, it closely corresponds to how humans parse a text. If you have
a regex C<my regex identifier { \w+ }> and
C<my regex keyword { if | else | endif }>, you intuitively expect the
C<identifier> to gobble up a whole word and not have it give up its end to
the next rule, if the next rule otherwise fails.

For example, you don't
expect the word C<motif> to be parsed as the identifier C<mot> followed by
the keyword C<if>. Instead, you expect C<motif> to be parsed as one identifier;
and if the parser expects an C<if> afterwards, best that it should fail than
have it parse the input in a way you don't expect.

Since ratcheting behavior is often desirable in parsers, there's a
shortcut to declaring a ratcheting regex:

    =begin code :skip-test
    my token thing { .... }
    # short for
    my regex thing { :r ... }
    =end code

=head3 X<Sigspace|regex adverb,:sigspace;regex adverb,:s>

The B<C<:sigspace>> or B<C<:s>> adverb makes whitespace significant in a
regex.

    =begin code :allow<B>
    say so "I used Photoshop®"   ~~ m:i/   photo shop /;      # OUTPUT: «True␤»
    say so "I used a photo shop" ~~ m:iB<:s>/ photo shop /;   # OUTPUT: «True␤»
    say so "I used Photoshop®"   ~~ m:iB<:s>/ photo shop /;   # OUTPUT: «False␤»
    =end code

C<m:s/ photo shop /> acts the same as
C<m/ photo <.ws> shop <.ws> />. By default, C<< <.ws> >> makes sure that
words are separated, so C<a    b> and C<^&> will match C<< <.ws> >> in the
middle, but C<ab> won't.

Where whitespace in a regex turns into C«<.ws>» depends on what comes before
the whitespace. In the above example, whitespace in the beginning of a regex
doesn't turn into C«<.ws>», but whitespace after characters does. In
general, the rule is that if a term might match something, whitespace after
it will turn into C«<.ws>».

In addition, if whitespace comes after a term but I<before> a quantifier
(C<+>, C<*>, or C<?>), C«<.ws>» will be matched after every match of the
term. So, C<foo +> becomes C«[ foo <.ws> ]+». On the other hand, whitespace
I<after> a quantifier acts as normal significant whitespace; e.g., "C<foo+ >"
becomes C«foo+ <.ws>».

In all, this code:

    =begin code :allow<B>
    rx :s {
        ^^
        {
            say "No sigspace after this";
        }
        <.assertion_and_then_ws>
        characters_with_ws_after+
        ws_separated_characters *
        [
        | some "stuff" .. .
        | $$
        ]
        :my $foo = "no ws after this";
        $foo
    }
    =end code

Becomes:

    =begin code :allow<B>
    rx {
        ^^ B«<.ws>»
        {
            say "No space after this";
        }
        <.assertion_and_then_ws> B«<.ws>»
        characters_with_ws_after+ B«<.ws>»
        [ws_separated_characters B«<.ws>»]* B«<.ws>»
        [
        | some B«<.ws>» "stuff" B«<.ws>» .. B«<.ws>» . B«<.ws>»
        | $$ B«<.ws>»
        ] B«<.ws>»
        :my $foo = "no ws after this";
        $foo B«<.ws>»
    }
    =end code

If a regex is declared with the C<rule> keyword, both the C<:sigspace> and
C<:ratchet> adverbs are implied.

Grammars provide an easy way to override what C«<.ws>» matches:

    grammar Demo {
        token ws {
            <!ww>       # only match when not within a word
            \h*         # only match horizontal whitespace
        }
        rule TOP {      # called by Demo.parse;
            a b '.'
        }
    }

    # doesn't parse, whitespace required between a and b
    say so Demo.parse("ab.");                 # OUTPUT: «False␤»
    say so Demo.parse("a b.");                # OUTPUT: «True␤»
    say so Demo.parse("a\tb .");              # OUTPUT: «True␤»

    # \n is vertical whitespace, so no match
    say so Demo.parse("a\tb\n.");             # OUTPUT: «False␤»

When parsing file formats where some whitespace (for example, vertical
whitespace) is significant, it's advisable to override C<ws>.

=head3 Perl 5 compatibility adverb

The B<C<:Perl5>> or B<C<:P5>> adverb switch the Regex parsing and matching
to the way Perl 5 regexes behave:

    so 'hello world' ~~ m:Perl5/^hello (world)/;   # OUTPUT: «True␤»
    so 'hello world' ~~ m/^hello (world)/;         # OUTPUT: «False␤»
    so 'hello world' ~~ m/^ 'hello ' ('world')/;   # OUTPUT: «True␤»

The regular behavior is recommended and more idiomatic in Perl 6 of course,
but the B<C<:Perl5>> adverb can be useful when compatibility with Perl5 is required.

=head2 Matching adverbs

In contrast to regex adverbs, which are tied to the declaration of a regex,
matching adverbs only make sense when matching a string against a regex.

They can never appear inside a regex, only on the outside – either as part
of an C<m/.../> match or as arguments to a match method.

=head3 X<Continue|matching adverb,:continue;matching adverb,:c>

The C<:continue> or short C<:c> adverb takes an argument. The argument is
the position where the regex should start to search. By default, it searches
from the start of the string, but C<:c> overrides that. If no position is
specified for C<:c>, it will default to C<0> unless C<$/> is set, in which
case, it defaults to C<$/.to>.

    given 'a1xa2' {
        say ~m/a./;         # OUTPUT: «a1␤»
        say ~m:c(2)/a./;    # OUTPUT: «a2␤»
    }

I<Note:> unlike C<:pos>, a match with :continue() will attempt to
match further in the string, instead of failing:

    say "abcdefg" ~~ m:c(3)/e.+/; # OUTPUT: «｢efg｣␤»
    say "abcdefg" ~~ m:p(3)/e.+/; # OUTPUT: «False␤»

=head3 X<Exhaustive|matching adverb,:exhaustive;matching adverb,:ex>

To find all possible matches of a regex – including overlapping ones – and
several ones that start at the same position, use the C<:exhaustive> (short
C<:ex>) adverb.

    given 'abracadabra' {
        for m:exhaustive/ a .* a / -> $match {
            say ' ' x $match.from, ~$match;
        }
    }

The above code produces this output:

=begin code :skip-test
abracadabra
abracada
abraca
abra
   acadabra
   acada
   aca
     adabra
     ada
       abra
=end code

=head3 X<Global|regex adverb,:global;regex adverb,:g>

Instead of searching for just one match and returning a
L<Match object|/type/Match>, search for every non-overlapping match and
return them in a L<List|/type/List>. In order to do this, use the C<:global>
adverb:

    given 'several words here' {
        my @matches = m:global/\w+/;
        say @matches.elems;         # OUTPUT: «3␤»
        say ~@matches[2];           # OUTPUT: «here␤»
    }

C<:g> is shorthand for C<:global>.

=head3 X<Pos|regex adverb,:pos;regex adverb,:p>

Anchor the match at a specific position in the string:

    given 'abcdef' {
        my $match = m:pos(2)/.*/;
        say $match.from;        # OUTPUT: «2␤»
        say ~$match;            # OUTPUT: «cdef␤»
    }

C<:p> is shorthand for C<:pos>.

I<Note:> unlike C<:continue>, a match anchored with :pos() will fail,
instead of attempting to match further down the string:

    say "abcdefg" ~~ m:c(3)/e.+/; # OUTPUT: «｢efg｣␤»
    say "abcdefg" ~~ m:p(3)/e.+/; # OUTPUT: «False␤»

=head3 X<Overlap|regex adverb,:overlap;regex adverb,:ov>

To get several matches, including overlapping matches, but only one (the
longest) from each starting position, specify the C<:overlap> (short C<:ov>)
adverb:

    given 'abracadabra' {
        for m:overlap/ a .* a / -> $match {
            say ' ' x $match.from, ~$match;
        }
    }

produces

=begin code :skip-test
abracadabra
   acadabra
     adabra
       abra
=end code

=head2 Substitution adverbs

You can apply matching adverbs (such as C<:global>, C<:pos> etc.) to
substitutions. In addition, there are adverbs that only make sense for
substitutions, because they transfer a property from the matched string
to the replacement string.

=head3 X<Samecase|substitution adverb,:samecase;substitution adverb,:ii>

The C<:samecase> or C<:ii> substitution adverb implies the
C<:ignorecase> adverb for the regex part of the substitution, and in
addition carries the case information to the replacement string:

=begin code
$_ = 'The cat chases the dog';
s:global:samecase[the] = 'a';
say $_;                 # OUTPUT: «A cat chases a dog»
=end code

Here you can see that the first replacement string C<a> got capitalized,
because the first string of the matched string was also a capital
letter.

=head3 X<Samemark|substitution adverb,:samemark;substitution adverb,:mm>

The C<:samemark> or C<:mm> adverb implies C<:ignoremark> for the regex,
and in addition, copies the markings from the matched characters to the
replacement string:

=begin code
given 'äộñ' {
    say S:mm/ a .+ /uia/;           # OUTPUT: «üị̂ã»
}
=end code

=head3 X<Samespace|substitution adverb,:samespace;substitution adverb,:ss>

The C<:samespace> or C<:ss> substitution modifier implies the
C<:sigspace> modifier for the regex, and in addition, copies the
whitespace from the matched string to the replacement string:

=begin code
say S:samespace/a ./c d/.perl given "a b";      # OUTPUT: «"c d"»
say S:samespace/a ./c d/.perl given "a\tb";     # OUTPUT: «"c\td"»
say S:samespace/a ./c d/.perl given "a\nb";     # OUTPUT: «"c\nd"»
=end code

The C<ss/.../.../> syntactic form is a shorthand for
C<s:samespace/.../.../>.

=head1 Best practices and gotchas

To help with robust regexes and grammars, here are some best practices
for code layout and readability, what to actually match, and avoiding common
pitfalls.

=head2 Code layout

Without the C<:sigspace> adverb, whitespace is not significant in Perl 6
regexes. Use that to your own advantage and insert whitespace where it
increases readability. Also, insert comments where necessary.

Compare the very compact

    my regex float { <[+-]>?\d*'.'\d+[e<[+-]>?\d+]? }

to the more readable

    my regex float {
         <[+-]>?        # optional sign
         \d*            # leading digits, optional
         '.'
         \d+
         [              # optional exponent
            e <[+-]>?  \d+
         ]?
    }

As a rule of thumb, use whitespace around atoms and inside groups; put
quantifiers directly after the atom; and vertically align opening and closing
square brackets and parentheses.

When you use a list of alternations inside parentheses or square brackets, align
the vertical bars:

    my regex example {
        <preamble>
        [
        || <choice_1>
        || <choice_2>
        || <choice_3>
        ]+
        <postamble>
    }

=head2 Keep it small

Regexes are often more compact than regular code. Because they do so much with
so little, keep regexes short.

When you can name a part of a regex, it's usually best to
put it into a separate, named regex.

For example, you could take the float regex from earlier:

    my regex float {
         <[+-]>?        # optional sign
         \d*            # leading digits, optional
         '.'
         \d+
         [              # optional exponent
            e <[+-]>?  \d+
         ]?
    }

And decompose it into parts:

    my token sign { <[+-]> }
    my token decimal { \d+ }
    my token exponent { 'e' <sign>? <decimal> }
    my regex float {
        <sign>?
        <decimal>?
        '.'
        <decimal>
        <exponent>?
    }

That helps, especially when the regex becomes more complicated. For example,
you might want to make the decimal point optional in the presence of an exponent.

    my regex float {
        <sign>?
        [
        || <decimal>?  '.' <decimal> <exponent>?
        || <decimal> <exponent>
        ]
    }

=head2 What to match

Often the input data format has no clear-cut specification, or the
specification is not known to the programmer. Then, it's good to be liberal
in what you expect, but only so long as there are no possible ambiguities.

For example, in C<ini> files:

    =begin code :skip-test
    [section]
    key=value
    =end code

What can be inside the section header? Allowing only a word might be too
restrictive. Somebody might write C<[two words]>, or use dashes, etc.
Instead of asking what's allowed on the inside, it might be worth asking
instead: I<what's not allowed?>

Clearly, closing square brackets are not allowed, because C<[a]b]> would be
ambiguous. By the same argument, opening square brackets should be forbidden.
This leaves us with

    token header { '[' <-[ \[\] ]>+ ']' }

which is fine if you are only processing one line. But if you're processing
a whole file, suddenly the regex parses

    =begin code :lang<text>
    [with a
    newline in between]
    =end code

which might not be a good idea.  A compromise would be

    token header { '[' <-[ \[\] \n ]>+ ']' }

and then, in the post-processing, strip leading and trailing spaces and tabs
from the section header.

=head2 Matching whitespace

The C<:sigspace> adverb (or using the C<rule> declarator instead of C<token>
or C<regex>) is very handy for implicitly parsing whitespace that can appear
in many places.

Going back to the example of parsing C<ini> files, we have

    my regex kvpair { \s* <key=identifier> '=' <value=identifier> \n+ }

which is probably not as liberal as we want it to be, since the user might
put spaces around the equals sign. So, then we may try this:

    my regex kvpair { \s* <key=identifier> \s* '=' \s* <value=identifier> \n+ }

But that's looking unwieldy, so we try something else:

    my rule kvpair { <key=identifier> '=' <value=identifier> \n+ }

But wait! The implicit whitespace matching after the value uses up all
whitespace, including newline characters, so the C<\n+> doesn't have
anything left to match (and C<rule> also disables backtracking, so no luck
there).

Therefore, it's important to redefine your definition of implicit whitespace
to whitespace that is not significant in the input format.

This works by redefining the token C<ws>; however, it only works for
L<grammars|/language/grammars>:

    grammar IniFormat {
        token ws { <!ww> \h* }
        rule header { \s* '[' (\w+) ']' \n+ }
        token identifier  { \w+ }
        rule kvpair { \s* <key=identifier> '=' <value=identifier> \n+ }
        token section {
            <header>
            <kvpair>*
        }

        token TOP {
            <section>*
        }
    }

    my $contents = q:to/EOI/;
        [passwords]
            jack = password1
            joy = muchmoresecure123
        [quotas]
            jack = 123
            joy = 42
    EOI
    say so IniFormat.parse($contents);

Besides putting all regexes into a grammar and turning them into tokens
(because they don't need to backtrack anyway), the interesting new bit is

        token ws { <!ww> \h* }

which gets called for implicit whitespace parsing. It matches when it's not
between two word characters (C<< <!ww> >>, negated "within word" assertion),
and zero or more horizontal space characters. The limitation to horizontal
whitespace is important, because newlines (which are vertical whitespace)
delimit records and shouldn't be matched implicitly.

Still, there's some whitespace-related trouble lurking. The regex C<\n+>
won't match a string like C<"\n \n">, because there's a blank between the
two newlines. To allow such input strings, replace C<\n+> with C<\n\s*>.


=head1 Backtracking

Perl 6 defaults to I<backtracking> when evaluating regular expressions.
Backtracking is a technique that allows the engine to try different
matching in order to allow every part of a regular expression to succeed.
This is costly, because it requires the engine to usually
eat up as much as possible in the first match and then
adjust going backwards in order to ensure all regular expression parts
have a chance to match.

In order to better understand backtracking, consider the following example:

=begin code
my $string = 'PostgreSQL is an SQL database!';
say $string ~~ /(.+)(SQL) (.+) $1/; # OUTPUT: ｢PostgreSQL is an SQL｣
=end code

What happens in the above example is that the string has to be matched against
the second occurrence of the word I<SQL>, eating all characters before and
leaving out the rest.

Since it is possible to execute a piece of code within a regular expression, it
is also possible to inspect the L<Match|/type/Match> object within the regular
expression itself:

=begin code :preamble<my $string = '';>
my $iteration = 0;
sub show-captures( Match $m ){
    my Str $result_split;
    say "\n=== Iteration {++$iteration} ===";
    for $m.list.kv -> $i, $capture {
        say "Capture $i = $capture";
        $result_split ~= '[' ~ $capture ~ ']';
    }

    say $result_split;
}

$string ~~ /(.+)(SQL) (.+) $1 (.+) { show-captures( $/ );  }/;
=end code

The C<show-captures> method will dump all the elements of C<$/> producing
the following output:

=for code :lang<text>
=== Iteration 1 ===
Capture 0 = Postgre
Capture 1 = SQL
Capture 2 =  is an
[Postgre][SQL][ is an ]

showing that the string has been split around the second occurrence of I<SQL>, that
is the repetition of the first capture (C<$/[1]>).

With that in place, it is now possible to see how the engine backtracks to find
the above match: it does suffice to move the C<show-captures> in the middle of
the regular expression, in particular before the repetition of the first capture
C<$1> to see it in action:

=begin code :preamble<my $string = '';>
my $iteration = 0;
sub show-captures( Match $m ){
    my Str $result-split;
    say "\n=== Iteration {++$iteration} ===";
    for $m.list.kv -> $i, $capture {
        say "Capture $i = $capture";
        $result-split ~= '[' ~ $capture ~ ']';
    }

    say $result-split;
}

$string ~~ / (.+)(SQL) (.+) { show-captures( $/ );  } $1 /;
=end code

The output will be much more verbose and will show several iterations, with the
last one being the I<winning>. The following is an excerpt of the output:

=begin code :lang<text>
=== Iteration 1 ===
Capture 0 = PostgreSQL is an
Capture 1 = SQL
Capture 2 =  database!
[PostgreSQL is an ][SQL][ database!]

=== Iteration 2 ===
Capture 0 = PostgreSQL is an
Capture 1 = SQL
Capture 2 =  database
[PostgreSQL is an ][SQL][ database]

...

=== Iteration 24 ===
Capture 0 = Postgre
Capture 1 = SQL
Capture 2 =  is an
[Postgre][SQL][ is an ]
=end code

In the first iteration the I<SQL> part of I<PostgreSQL> is kept within the word: that is not what
the regular expression asks for, so there's the need for another iteration. The second iteration will move back,
in particular one character back (removing thus the final I<!>) and try to match again, resulting
in a fail since again the I<SQL> is still kept within I<PostgreSQL>.
After several iterations, the final result is match.

It is worth noting that the final iteration is number I<24>, and that such number is exactly
the distance, in number of chars, from the end of the string to the first I<SQL> occurrence:

=begin code :preamble<my $string = '';>
say $string.chars - $string.index: 'SQL'; # OUTPUT: 23
=end code

Since there are 23 chars from the very end of the string to the very first I<S> of I<SQL>
the backtracking engine will need 23 "useless" matches to find the right one, that
is will need 24 steps to get the final result.

Backtracking is a costly machinery, therefore it is possible to disable
it in those cases where the matching can be found I<forward> only.

With regards to the above example, disabling backtracking means
the regular expression will not have any chance to match:

=begin code :preamble<my $string = '';>
say $string ~~ /(.+)(SQL) (.+) $1/;      # OUTPUT: ｢PostgreSQL is an SQL｣
say $string ~~ / :r (.+)(SQL) (.+) $1/;  # OUTPUT: Nil
=end code

The fact is that, as shown in the I<iteration 1> output, the first match of the
regular expression engine will be C<PostgreSQL is an >, C<SQL>, C< database>
that does not leave out any room for matching another occurrence of the word
I<SQL> (as C<$1> in the regular expression). Since the engine is not able to get
backward and change the path to match, the regular expression fails.

It is worth noting that disabling backtracking will not prevent the engine
to try several ways to match the regular expression.
Consider the following slightly changed example:

=begin code
my $string = 'PostgreSQL is an SQL database!';
say $string ~~ / (SQL) (.+) $1 /; # OUTPUT: Nil
=end code

Since there is no specification for a character before the word I<SQL>,
the engine will match against the rightmost word I<SQL> an go forward
from there. Since there is no repetition of I<SQL> remaining, the
match fails.
It is possible, again, to inspect what the engine performs
introducing a dumping piece of code within the regular expression:

=begin code :preamble<my $string = '';>
my $iteration = 0;
sub show-captures( Match $m ){
    my Str $result-split;
    say "\n=== Iteration {++$iteration} ===";
    for $m.list.kv -> $i, $capture {
        say "Capture $i = $capture";
        $result-split ~= '[' ~ $capture ~ ']';
    }

    say $result-split;
}

$string ~~ / (SQL) (.+) { show-captures( $/ ); } $1 /;
=end code

that produces a rather simple output:

=begin code :lang<text>
=== Iteration 1 ===
Capture 0 = SQL
Capture 1 =  is an SQL database!
[SQL][ is an SQL database!]

=== Iteration 2 ===
Capture 0 = SQL
Capture 1 =  database!
[SQL][ database!]
=end code

Even using the L<:r|/language/regexes#ratchet> adverb to prevent backtracking
will not change things:

=begin code :preamble<my $string = '';>
my $iteration = 0;
sub show-captures( Match $m ){
    my Str $result-split;
    say "\n=== Iteration {++$iteration} ===";
    for $m.list.kv -> $i, $capture {
        say "Capture $i = $capture";
        $result-split ~= '[' ~ $capture ~ ']';
    }

    say $result-split;
}

$string ~~ / :r (SQL) (.+) { show-captures( $/ ); } $1 /;
=end code

and the output will remain the same:

=begin code :lang<text>
=== Iteration 1 ===
Capture 0 = SQL
Capture 1 =  is an SQL database!
[SQL][ is an SQL database!]

=== Iteration 2 ===
Capture 0 = SQL
Capture 1 =  database!
[SQL][ database!]
=end code

This demonstrate that disabling backtracking does not mean disabling possible
multiple iterations of the matching engine, but rather disabling the backward
matching tuning.


=head1 C<$/> changes each time a regular expression is matched

It is worth noting that each time a regular expression is used, the
L<Match object|/type/Match> returned (i.e., C<$/>) is reset. In other
words, C<$/> always refers to the very last regular expression
matched:

=begin code
my $answer = 'a lot of Stuff';
say 'Hit a capital letter!' if $answer ~~ / <[A..Z>]> /;
say $/;  # OUTPUT: ｢S｣
say 'hit an x!' if $answer ~~ / x /;
say $/;  # OUTPUT: Nil
=end code

The reset of C<$/> applies independently from the scope where the regular expression
is matched:

=begin code
my $answer = 'a lot of Stuff';
if $answer ~~ / <[A..Z>]> / {
   say 'Hit a capital letter';
   say $/;  # OUTPUT: ｢S｣
}
say $/;  # OUTPUT: ｢S｣

if True {
  say 'hit an x!' if $answer ~~ / x /;
  say $/;  # OUTPUT: Nil
}

say $/;  # OUTPUT: Nil
=end code

The very same concept applies to named captures:

=begin code
my $answer = 'a lot of Stuff';
if $answer ~~ / $<capital>=<[A..Z>]> / {
   say 'Hit a capital letter';
   say $/<capital>; # OUTPUT: ｢S｣
}

say $/<capital>;    # OUTPUT: ｢S｣
say 'hit an x!' if $answer ~~ / $<x>=x /;
say $/<x>;          # OUTPUT: Nil
say $/<capital>;    # OUTPUT: Nil
=end code


=end pod
# vim: expandtab softtabstop=4 shiftwidth=4 ft=perl6