lib/Language/regexes.pod

=begin pod

=TITLE Regexes

=SUBTITLE Pattern matching against strings

Regular expressions are a computer science concept where simple patterns
describe the format of text. Pattern matching is the process of applying
these patterns to actual text to look for matches.

Most modern regular expression facilities are more powerful than traditional
regular expressions due to the influence of languages such as Perl, but the
short-hand term I<regex> has stuck and continues to mean I<regular
expression-like pattern matching>.

In Perl 6, although they are capable of much more than regular languages, we
continue to call them I<regexes>.

=head1 X<Lexical conventions|quote,/ /;quote,rx;quote,m>

Perl 6 has special syntax for writing regexes:

    m/abc/;         # a regex that is immediately matched against $_
    rx/abc/;        # a Regex object
    /abc/;          # a Regex object

The first two can use delimiters other than the slash:

    m{abc};
    rx{abc};

Note that neither the colon C<:> nor round parentheses can be delimiters;
the colon is forbidden because it clashes with adverbs, such as C<rx:i/abc/>
(case insensitive regexes), and round parentheses indicate a function call
instead.

Whitespace in regexes is generally ignored (except with the C<:s> or
C<:sigspace> adverb).

As in the rest of Perl 6, comments in regexes start with a hash character
C<#> and go to the end of the current line.

=head1 Literals

The simplest case of a regex is a constant string. Matching a string against
that regex searches for that string:

    if 'properly' ~~ m/ perl / {
        say "'properly' contains 'perl'";
    }

Alphanumeric characters and the underscore C<_> are literal matches.  All
other characters must either be escaped with a backslash (for example C<\:>
to match a colon), or included in quotes:

    / 'two words' /     # matches 'two words' including the blank
    / "a:b"       /     # matches 'a:b' including the colon
    / '#' /             # matches a hash character

Strings are searched left to right for the regex, thus it is sufficient if a
substring matches the regex:

    if 'abcdef' ~~ / de / {
        say ~$/;            # de
        say $/.prematch;    # abc
        say $/.postmatch;   # f
        say $/.from;        # 3
        say $/.to;          # 5
    };

Match results are stored in the C<$/> variable and are also returned from
the match. The result is of L<type Match|/type/Match> if the match was successful;
otherwise it is L<Nil|/type/Nil>.

=head1 Wildcards and character classes

=head2 X<Dot to match any character|regex,.>

An unescaped dot C<.> in a regex matches any single character.

So these all match:

    'perl' ~~ /per./;       # matches the whole string
    'perl' ~~ / per . /;    # the same; whitespace is ignored
    'perl' ~~ / pe.l /;     # the . matches the r
    'speller' ~~ / pe.l/;   # the . matches the first l

This doesn't match:

    'perl' ~~ /. per /

because there is no character to match before C<per> in the target string.

=head2 Backslashed, predefined character classes

There are predefined character classes of the form C<\w>. Its negation is
written with an upper-case letter, C<\W>.

=item X<\d and \D|regex,\d;regex,\D>

C<\d> matches a single digit (Unicode property C<N>) and C<\D> matches a
single character that is not a digit.

    'ab42' ~~ /\d/ and say ~$/;     # 4
    'ab42' ~~ /\D/ and say ~$/;     # a

Note that not only the Arabic digits (commonly used in the Latin alphabet)
match C<\d>, but also digits from other scripts.

Examples for digits are:

    U+0035 5 DIGIT FIVE
    U+07C2 ߂ NKO DIGIT TWO
    U+0E53 ๓ THAI DIGIT THREE
    U+1B56 ᭖ BALINESE DIGIT SIX

=item X<\h and \H|regex,\h;regex,\H>

C<\h> matches a single horizontal whitespace character. C<\H> matches a
single character that is not a horizontal whitespace character.

Examples for horizontal whitespace characters are

    U+0020 SPACE
    U+00A0 NO-BREAK SPACE
    U+0009 CHARACTER TABULATION
    U+2001 EM QUAD

Vertical whitespace like newline characters are explicitly excluded; those
can be matched with C<\v>, and C<\s> matches any kind of whitespace.

=item X<\n and \N|regex,\n;regex,\N>

C<\n> matches a single, logical newline character. C<\n> is supposed to also
match a Windows CR LF codepoint pair; though it is unclear whether the magic
happens at the time that external data is read, or at regex match time.
C<\N> matches a single character that's not a logical newline.

=item X<\s and \S|regex,\s;regex,\S>

C<\s> matches a single whitespace character. C<\S> matches a single
character that is not whitespace.

    if 'contains a word starting with "w"' ~~ / w \S+ / {
        say ~$/;        # word
    }

=item X<\t and \T|regex,\t;regex,\T>

C<\t> matches a single tab/tabulation character, C<U+0009>. (Note that
exotic tabs like the C<U+000B VERTICAL TABULATION> character are not
included here). C<\T> matches a single character that is not a tab.

=item X<\v and \V|regex,\v;regex,\V>

C<\v> matches a single vertical whitespace character. C<\V> matches a single
character that is not vertical whitespace.

Examples for vertical whitespace characters:

    U+000A LINE FEED
    U+000B VERTICAL TABULATION
    U+000C CARRIAGE RETURN
    U+0085 NEXT LINE
    U+2029 PARAGRAPH SEPARATOR

Use C<\s> to match any kind of whitespace, not just vertical whitespace.

=item X<\w and \W|regex,\w;regex,\W>

C<\w> matches a single word character, i.e. a letter (Unicode category L), a
digit or an underscore. C<\W> matches a single character that isn't a word
character.

Examples of word characters:

    0041 A LATIN CAPITAL LETTER A
    0031 1 DIGIT ONE
    03B4 δ GREEK SMALL LETTER DELTA
    03F3 ϳ GREEK LETTER YOT
    0409 Љ CYRILLIC CAPITAL LETTER LJE

=head2 X«Unicode properties|regex,<:property>»

The character classes so far are mostly for convenience; a more systematic
approach is the use of Unicode properties. They are called in the form C<<
<:property> >>, where C<property> can be a short or long Unicode property
name.

The following list is stolen from the Perl 5
L<perlunicode|http://perldoc.perl.org/perlunicode.html> documentation:

=begin table

    Short       Long
    =====       =====
    L           Letter
    LC          Cased_Letter
    Lu          Uppercase_Letter
    Ll          Lowercase_Letter
    Lt          Titlecase_Letter
    Lm          Modifier_Letter
    Lo          Other_Letter
    M           Mark
    Mn          Nonspacing_Mark
    Mc          Spacing_Mark
    Me          Enclosing_Mark
    N           Number
    Nd          Decimal_Number (also Digit)
    Nl          Letter_Number
    No          Other_Number
    P           Punctuation (also Punct)
    Pc          Connector_Punctuation
    Pd          Dash_Punctuation
    Ps          Open_Punctuation
    Pe          Close_Punctuation
    Pi          Initial_Punctuation
                (may behave like Ps or Pe depending on usage)
    Pf          Final_Punctuation
                (may behave like Ps or Pe depending on usage)
    Po          Other_Punctuation
    S           Symbol
    Sm          Math_Symbol
    Sc          Currency_Symbol
    Sk          Modifier_Symbol
    So          Other_Symbol
    Z           Separator
    Zs          Space_Separator
    Zl          Line_Separator
    Zp          Paragraph_Separator
    C           Other
    Cc          Control (also Cntrl)
    Cf          Format
    Cs          Surrogate
    Co          Private_Use
    Cn          Unassigned

=end table

For example C<< <:Lu> >> matches a single, upper-case letter.

Negation works as C<< <:!category> >>, so C<< <:!Lu> >> matches a single
character that isn't an upper-case letter.

Several categories can be combined with one of these infix operators:

=begin table

    Operator    Meaning
    ========    =======
     +          set union
     |          set union
     &          set intersection
     -          set difference (first minus second)
     ^          symmetric set intersection / XOR

=end table

So, to match either a lower-case letter or a number, one can write
C<< <:Ll+:N> >> or C<< <:Ll+:Number> >> or C<< <+ :Lowercase_Letter + :Number> >>.

It is also possible to group categories and sets of categories with
parentheses, e.g.:

    'perl6' ~~ m{\w+(<:Ll+:N>)}  # 0 => ｢6｣

=head2 X«Enumerated character classes and ranges|regex,<[ ]>;regex,<-[ ]>»

Sometimes the pre-existing wildcards and character classes are not enough.
Fortunately, defining your own is fairly simple. Between C<< <[ ]> >>, you
can put any number of single characters and ranges of characters (expressed
with two dots between the end points), with or without whitespace.

    "abacabadabacaba" ~~ / <[ a .. c 1 2 3 ]> /

Between the C<< < > >> you can also use the same operators for categories
(C<+>, C<|>, C<&>, C<->, C<^>) to combine multiple range definitions and
even mix in some of the unicode categories above. You are also allowed to
write the backslashed forms for character classes between the C< [ ] >.

    / <[\d] - [13579]> /
    # not quite the same as
    / <[02468]>
    # because the first one also contains "weird" unicodey digits

To negate a character class, put a C<-> after the opening angle:

    say 'no quotes' ~~ /  <-[ " ]> + /;  # matches characters except "

A common pattern for parsing quote-delimited strings involves negated
character classes:

    say '"in quotes"' ~~ / '"' <-[ " ]> * '"'/;

This first matches a quote, then any characters that aren't quotes, and then
a quote again. The meaning of C<*> and C<+> in the examples above are
explained in L<section Quantifier|#Quantifier>.

Just as you can use the C<-> for both set difference and negation of a
single value, you can also explicitly put a C<+> in front:

    / <+[123]> /  # same as <[123]>

=head1 Quantifiers

A quantifier makes a preceding atom match not exactly once, but rather a
variable number of times. For example C<a+> matches one or more C<a>
characters.

Quantifiers bind tighter than concatenation, so C<ab+> matches one C<a>
followed by one or more C<b>s. This is different for quotes, so C<'ab'+>
matches the strings C<ab>, C<abab>, C<ababab> etc.

=head2 X<One or more: +|regex,+>

The C<+> quantifier makes the preceding atom match one or more times, with
no upper limit.

For example to match strings of the form C<key=value>, you can write a regex
like this:

    / \w+ '=' \w+ /

=head2 X<Zero or more: *|regex,*>

The C<*> quantifier makes the preceding atom match zero or more times, with
no upper limit.

For example to allow optional whitespace between C<a> and C<b> you can write

    / a \s* b /

=head2 X<Zero or one match: ?|regex,?>

The C<?> quantifier makes the preceding atom match zero or once.

=head2 X<General quantifier: ** min..max|regex quantifier,**>

To quantify an atom an arbitrary number of times, you can write e.g.
C<a ** 2..5> to match the character C<a> at least twice and at most 5 times.

=begin code
    say Bool('a' ~~ /a ** 2..5/);    #-> False
    say Bool('aaa' ~~ /a ** 2..5/);  #-> True
=end code

If the minimal and maximal number of matches are the same, a single integer
is possible: C<a ** 5> matches C<a> exactly five times.

=begin code
    say Bool('aaaaa' ~~ /a ** 5/);   #-> True
=end code

=head2 X<Modified quantifier: %|regex,%;regex,%%>

To more easily match things like comma separated values, you can tack on a
C<%> modifier to any of the above quantifiers to specify a separator than must
occur between each of the matches.  So, for example C<a+ % ','> will match
C<a> or C<a,a> or C<a,a,a> or so on, but it will not match C<a,> or C<a,a,>.
To match those as well, you may use C<%%> instead of C<%>.

=head1 X<Alternation|regex,||>

To match one of several possible alternatives, separate them by C<||>; the
first matching alternative wins.

For example, C<ini> files have the following form:

    [section]
    key = value

Hence, if you parse a single line of an C<ini> file, it can be either a
section or a key-value pair and the regex would be (to a first
approximation):

    / '[' \w+ ']' || \S+ \s* '=' \s* \S* /

That is, either a word surrounded by square brackets, or a string of
non-whitespace characters, followed by zero or more spaces, followed by the
equals sign C<=>, followed again by optional whitespace, followed by another
string of non-whitespace characters.

=head1 Anchors

The regex engine tries to find a match inside a string by searching from
left to right.

    say so 'properly' ~~ / perl/;   # True
    #          ^^^^

But sometimes this is not what you want.  For instance, you might want to
match the whole string, or a whole line, or one or several whole words.
I<Anchors> or I<assertions> can help you with this by limiting where they
match.

Anchors need to match successfully in order for the whole regex to match but
they do not use up characters while matching.

=head2 X«C<^>, Start of String|regex,^»

The C<^> assertion only matches at the start of the string.

    say so 'properly' ~~ /perl/;        # True
    say so 'properly' ~~ /^ perl/;      # False
    say so 'perly'    ~~ /^ perl/;      # True
    say so 'perl'     ~~ /^ perl/;      # True

=head2 X«C<^^>, Start of Line and C<$$>, End of Line|regex,^^;regex,$$»

The C<^^> assertion matches at the start of a logical line. That is, either
at the start of the string, or after a newline character.

C<$$> matches only at the end of a logical line, that is, before a newline
character, or at the end of the string when the last character is not a
newline character.

(To understand the following example, it is important to know that the
C<q:to/EOS/...EOS> "heredoc" syntax removes leading indention to the same
level as the C<EOS> marker, so that the first, second and last lines have no
leading space and the third and fourth lines have two leading spaces each).

=begin code
    my $str = q:to/EOS/;
        There was a young man of Japan
        Whose limericks never would scan.
          When asked why this was,
          He replied "It's because
        I always try to fit as many syllables into the last line as ever I possibly can."
        EOS
    say so $str ~~ /^^ There/;          # True  (start of string)
    say so $str ~~ /^^ limericks/;      # False (not at the start of a line)
    say so $str ~~ /^^ I/;              # True  (start of the last line)
    say so $str ~~ /^^ When/;           # False (there are blanks between
                                        #        start of line and the "When")

    say so $str ~~ / Japan $$/;         # True  (end of first line)
    say so $str ~~ / scan $$/;          # False (there is a . between "scan"
                                        #        and the end of line)
    say so $str ~~ / '."' $$/;          # True  (at the last line)
=end code

=head2 X<<<<C<<< << >>> and C<<< >> >>>, left and right word boundary|regex,<<;regex,>>;regex,«;regex,»>>>>

C<<< << >>> matches a left word boundary: it matches positions where there
is a non-word character at the left (or the start of the string) and a word
character to the right.

C<<< >> >>> matches a right word boundary: it matches positions where there
is a word character at the left and a non-word character at the right (or
the end of the string).

    my $str = 'The quick brown fox';
    say so $str ~~ /br/;                # True
    say so $str ~~ /<< br/;             # True
    say so $str ~~ /br >>/;             # False
    say so $str ~~ /own/;               # True
    say so $str ~~ /<< own/;            # False
    say so $str ~~ /own >>/;            # True

=head1 X«Grouping and Capturing|regex,( );regex,[ ];regex,$<capture> =»

In regular (non-regex) Perl 6, you can use parentheses to group things
together, usually to override operator precedence:

    say 1 + 4 * 2;      # 9, because it is parsed as 1 + (4 * 2)
    say (1 + 4) * 2;    # 10

The same grouping facility is available in regexes:

    / a || b c /        # matches 'a' or 'bc'
    / ( a || b ) c /    # matches 'ac' or 'bc'

The same grouping applies to quantifiers:

    / a b+ /            # Matches an 'a' followed by one or more 'b's
    / (a b)+ /          # Matches one or more sequences of 'ab'
    / (a || b)+ /       # Matches a sequence of 'a's and 'b's, at least one long

An unquantified capture produces a L<Match> object.  When a capture is
quantified (except with the C<?> quantifier) the capture becomes a list of
L<Match> objects instead.

=head2 Capturing

The round parentheses don't just group, they also I<capture>; that is, they
make the string matched within the group available as a variable, and also as
an element of the resulting L<Match|/type/Match> object:

    my $str =  'number 42';
    if $str ~~ /'number ' (\d+) / {
        say "The number is $0";         # the number is 42
        # or
        say "The number is $/[0]";      # the number is 42
    }

Pairs of parentheses are numbered left to right, starting from zero.

    if 'abc' ~~ /(a) b (c)/ {
        say "0: $0; 1: $1";             # 0: a; 1: c
    }

The C<$0> and C<$1> etc. syntax is actually just a shorthand; these captures
are canonically available from the match object C<$/> by using it as a list,
so C<$0> is actually syntax sugar for C<$/[0]>.

Coercing the match object to a list gives an easy way to programmatically
access all elements:

    if 'abc' ~~ /(a) b (c)/ {
        say $/.list.join: ', '  # a, c
    }

=head2 Non-capturing grouping

The parentheses in regexes perform a double role: they group the regex
elements inside and they capture what is matched by the sub-regex inside.

To get only the grouping behavior, you can use square brackets C<[ ... ]>
instead.

    if 'abc' ~~ / [a||b] (c) / {
        say ~$0;                # c
    }

If you do not need the captures, using non-capturing groups provides three
benefits: it communicates the intent more clearly, it makes it easier to
count the capturing groups that you do care about and it is a bit faster.

=head2 Capture numbers

It is stated above that captures are numbered from left to right. While true
in principle, this is also overly simplistic.

The following rules are listed for the sake of completeness; when you find
yourself using them regularly, it is worth considering named captures (and
possibly subrules) instead.

Alternations reset the capture count:

    / (x) (y)  || (a) (.) (.) /
    # $0  $1      $0  $1  $2

Example:

    if 'abc' ~~ /(x)(y) || (a)(.)(.)/ {
        say ~$1;            # b
    }

If two (or more) alternations have a different number of captures,
the one with the most captures determines the index of the next capture:

=begin code
$_ = 'abcd';

if / a [ b (.) || (x) (y) ] (.) / {
    #      $0     $0  $1    $2
    say ~$2;            # d
}
=end code


Captures can be nested, in which case they are numbered per level

    if 'abc' ~~ / ( a (.) (.) ) / {
        say "Outer: $0";              # Outer: abc
        say "Inner: $0[0] and $0[1]"; # Inner: b and c
    }

=head2 Named captures

Instead of numbering captures, you can also give them names. The generic --
and slightly verbose -- way of naming captures is like this:

    if 'abc' ~~ / $<myname> = [ \w+ ] / {
        say ~$<myname>      # abc
    }

The access to the named capture, C<< $<myname> >>, is a shorthand for indexing
the match object as a hash, in other words: C<$/{ 'myname' }> or C<< $/<myname> >>.

Coercing the match object to a hash gives you easy programmatic access to
all named captures:

    if 'count=23' ~~ / $<variable>=\w+ '=' $<value>=\w+ / {
        my %h = $/.hash;
        say %h.keys.sort.join: ', ';        # value, variable
        say %h.values.sort.join: ', ';      # 23, count
        for %h.kv -> $k, $v {
            say "Found value '$v' with key '$k'";
            # outputs two lines:
            #   Found value 'count' with key 'variable'
            #   Found value '23' with key 'value'
        }
    }

There is a more convenient way to get named captures which is discussed in
the next section.

=head1 X<Subrules|declarator,regex>

Just like you can put pieces of code into subroutines, you can also put
pieces of regex into named rules.

    my regex line { \N*\n }
    if "abc\ndef" ~~ /<line> def/ {
        say "First line: ", $<line>.chomp;      # First line: abc
    }

A named regex can be declared with C<my regex thename { body here }>, and
called with C<< <thename> >>. At the same time, calling a named regex
installs a named capture with the same name.

If the capture should be of a different name, this can be achieved with the
syntax C<< <capturename=regexname> >>. If no capture at all is desired, a
leading dot will suppress it: C<< <.regexname> >>.

Here is a more complete (yet still fairly limited) code for parsing C<ini>
files:

    my regex header { \s* '[' (\w+) ']' \h* \n+ }
    my regex identifier  { \w+ }
    my regex kvpair { \s* <key=identifier> '=' <value=identifier> \n+ }
    my regex section {
        <header>
        <kvpair>*
    }

    my $contents = q:to/EOI/;
        [passwords]
            jack=password1
            joy=muchmoresecure123
        [quotas]
            jack=123
            joy=42
    EOI

    my %config;
    if $contents ~~ /<section>*/ {
        for $<section>.list -> $section {
            my %section;
            for $section<kvpair>.list -> $p {
                say $p<value>;
                %section{ $p<key> } = ~$p<value>;
            }
            %config{ $section<header>[0] } = %section;
        }
    }
    say %config.perl;
    # ("passwords" => {"jack" => "password1", "joy" => "muchmoresecure123"},
    #    "quotas" => {"jack" => "123", "joy" => "42"}).hash

Named regexes can and should be grouped in L<grammars|/language/grammars>. A
list of predefined subrules is listed in
L<S05|http://design.perl6.org/S05.html#Predefined_Subrules>.

=head1 Adverbs

Adverbs modify how regexes work and give very convenient shortcuts for
certain kinds of recurring tasks.

There are two kinds of adverbs: regex adverbs apply at the point where a
regex is defined and matching adverbs apply at the point that a regex
matches against a string.

This distinction often blurs, because matching and declaration are often
textually close but using the method form of matching makes the distinction
clear.

C<'abc' ~~ /../> is roughly equivalent to C<'abc'.match(/../)>, or even more
clearly written in separate lines:

    my $regex = /../;           # definition
    if 'abc'.match($regex) {    # matching
        say "'abc' has at least two characters";
    }

Regex adverbs like C<:i> go into the definition line and matching adverbs
like C<:overlap> are appended to the match call:

    my $regex = /:i . a/;
    for 'baA'.match($regex, :overlap) -> $m {
        say ~$m;
    }
    # output:
    #     ba
    #     aA

=head2 X<Regex Adverbs|regex adverb,:ignorecase;regex adverb,:i>

Adverbs that appear at the time of a regex declaration are part of the
actual regex and influence how the Perl 6 compiler translates the regex into
binary code.

For example, the C<:ignorecase> (C<:i>) adverb tells the compiler to ignore
the distinction between upper case, lower case and title case letters.

So C<'a' ~~ /A/> is false, but C<'a' ~~ /:i A/> is a successful match.

Regex adverbs can come before or inside a regex declaration and only affect
the part of the regex that comes afterwards, lexically.

These two regexes are equivalent:

    my $rx1 = rx:i/a/;      # before
    my $rx2 = rx/:i a/;     # inside

Whereas these two are not:

    my $rx3 = rx/a :i b/;   # matches only the b case insensitively
    my $rx4 = rx/:i a b/;   # matches completely case insensitively

Brackets and parentheses limit the scope of an adverb:

    / (:i a b) c /          # matches 'ABc' but not 'ABC'
    / [:i a b] c /          # matches 'ABc' but not 'ABC'

=head3 X<Ratchet|regex adverb,:ratchet;regex adverb,:r>

The C<:ratchet> or C<:r> adverb causes the regex engine not to backtrack.

Without this adverb, parts of a regex will try different ways to match a
string in order to make it possible for other parts of the regex to match.
For example in C<'abc' ~~ /\w+ ./>, the C<\w+> first eats up the whole
string, C<abc> but then the C<.> fails. Thus C<\w+> gives up a character,
matching only C<ab>, and the C<.> can successfully match the string C<c>.
This process of giving up characters (or in the case of alternations, trying
a different branch) is known as backtracking.

    say so 'abc' ~~ / \w+ . /;      # True
    say so 'abc' ~~ / :r \w+ . /;   # False

Ratcheting can be an optimization, because backtracking is costly. But more
importantly, it closely corresponds to how humans parse a text. If you have
a regex C<my regex identifier { \w+ }> and
C<my regex keyword { if | else | endif }>, you intuitively expect the
C<identifier> to gobble up a whole word and not have it give up its end to
the next rule, if the next rule otherwise fails. For instance, you don't
expect the word C<motif> to be parsed as the identifier C<mot> followed by
the keyword C<if>; rather you expect C<motif> to be parsed as one identifier
and if the parser expects an C<if> afterwards, rather have it fail than
parse the input in a way you don't expect.

Since ratcheting behavior is so often desirable in parsers, there is a
shortcut to declaring a ratcheting regex:

    my token thing { .... }
    # short for
    my regex thing { :r ... }

=head3 X<Sigspace|regex adverb,:sigspace;regex adverb,:s>

The B<C<:sigspace>> or B<C<:s>> adverb makes whitespace significant in a
regex.

    =begin code :allow<B>
    say so "I used Photoshop®"   ~~ m:i/   photo shop /; # True
    say so "I used a photo shop" ~~ m:iB<:s>/ photo shop /; # True
    say so "I used Photoshop®"   ~~ m:iB<:s>/ photo shop /; # False
    =end code

C<m:s/ photo shop /> acts just the same as if one had written
C<m/ photo <.ws> shop <.ws> />. By default, C<< <.ws> >> makes sure that
words are separated, so C<a    b> and C<^&> will match C<< <.ws> >> in the
middle, but C<ab> won't.

Where whitespace in a regex turns into C«<.ws>» depends on what comes before
the whitespace. In the above example, whitespace in the beginning of a regex
doesn't turn into C«<.ws>», but whitespace after characters does. In
general, the rule is that if a term might match something, whitespace after
it will turn into C«<.ws>».

In addition, if whitespace comes after a term, but I<before> a quantifier
(C<+>, C<*>, or C<?>), C«<.ws>» will be matched after every match of the
term, so C<foo +> becomes C«[ foo <.ws> ]+». On the other hand, whitespace
I<after> a quantifier acts as normal significant whitespace, e.g., "C<foo+
>" becomes C«foo+ <.ws>».

In all, this code:

    =begin code :allow<B>
    rx :s {
        ^^
        {
            say "No sigspace after this";
        }
        <.assertion_and_then_ws>
        characters_with_ws_after+
        ws_separated_characters *
        [
        | some "stuff" .. .
        | $$
        ]
        :my $foo = "no ws after this";
        $foo
    }
    =end code

Becomes:

    =begin code :allow<B>
    rx {
        ^^ B«<.ws>»
        {
            say "No space after this";
        }
        <.assertion_and_then_ws> B«<.ws>»
        characters_with_ws_after+ B«<.ws>»
        [ws_separated_characters B«<.ws>»]* B«<.ws>»
        [
        | some B«<.ws>» "stuff" B«<.ws>» .. B«<.ws>» . B«<.ws>»
        | $$ B«<.ws>»
        ] B«<.ws>»
        :my $foo = "no ws after this";
        $foo B«<.ws>»
    }
    =end code

If a regex is declared with the C<rule> keyword, both the C<:sigspace> and
C<:ratchet> adverbs are implied.

Grammars provide an easy way to override what C«<.ws>» matches:

    grammar Demo {
        token ws {
            <!ww>   # only match when not within a word
            \h*     # only match horizontal whitespace
        }
        rule TOP {  # called by Demo.parse;
            a b '.'
        }
    }

    # doesn't parse, whitspace required between a and b
    say so Demo.parse("ab.");       # False
    say so Demo.parse("a b.");      # True
    say so Demo.parse("a\tb .");    # True
    # \n is vertical whitespace, so no match
    say so Demo.parse("a\tb\n.");   # False

When parsing file formats where some whitespace (for example vertical
whitespace) is significant, it is advisable to override C<ws>.

=head2 Matching adverbs

In contrast to regex adverbs, which are tied to the declaration of a regex,
matching adverbs only make sense while matching a string against a regex.

They can never appear inside a regex, only on the outside -- either as part
of an C<m/.../> match or as arguments to a match method.

=head3 X<Continue|matching adverb,:continue;matching adverb,:c>

The C<:continue> or short C<:c> adverb takes an argument. The argument is
the position where the regex should start to search. By default, it searches
from the start of the string, but C<:c> overrides that. If no position is
specified for C<:c> it will default to C<0> unless C<$/> is set, in which
case it defaults to C<$/.to>.

    given 'a1xa2' {
        say ~m/a./;         # a1
        say ~m:c(2)/a./;    # a2
    }

=head3 Exhaustive

To find all possible matches of a regex -- including overlapping ones -- and
several ones that start at the same position, use the C<:exhaustive> (short
C<:ex>) adverb.

    given 'abracadabra' {
        for m:exhaustive/ a .* a / -> $match {
            say ' ' x $match.from, ~$match;
        }
    }

The above code produces this output:

=begin code
abracadabra
abracada
abraca
abra
   acadabra
   acada
   aca
     adabra
     ada
       abra
=end code

=head3 X<Global|regex adverb,:global;regex adverb,:g>

Instead of searching for just one match and returning a
L<Match object|/type/Match>, search for every non-overlapping match and
return them in a L<List|/type/List>.  In order to do this use the C<:global>
adverb:

    given 'several words here' {
        my @matches = m:global/\w+/;
        say @matches.elems;         # 3
        say ~@matches[2];           # here
    }

C<:g> is shorthand for C<:global>.

=head3 X<Pos|regex adverb,:pos;regex adverb,:p>

Anchor the match at a specific position in the string:

    given 'abcdef' {
        my $match = m:pos(2)/.*/;
        say $match.from;        # 2
        say ~$match;            # cdef
    }

C<:p> is shorthand for C<:pos>.

=head3 Overlap

To get several matches, including overlapping matches, but only one (the
longest) from each starting position, specify the C<:overlap> (short C<:ov>)
adverb:

    given 'abracadabra' {
        for m:overlap/ a .* a / -> $match {
            say ' ' x $match.from, ~$match;
        }
    }

produces

=begin code
abracadabra
   acadabra
     adabra
       abra
=end code

=head1 Look-around assertions

=head2 Lookahead assertions

To check that a pattern appears before another pattern, one can use a
lookahead assertion via the C<before> assertion.  This has the form:

    <?before pattern>

Thus, to search for the string C<foo> which is immediately followed by the
string C<bar>, one could use the following regexp:

    rx{ foo <?before bar> }

which one could use like so:

    say "foobar" ~~ rx{ foo <?before bar> };   #->  foo

However, if you want to search for a pattern which is B<not> immediately
followed by some pattern, then you need to use a negative lookahead
assertion, this has the form:

    <!before pattern>

Hence all occurrences of C<foo> which I<is not> before C<bar> would
be matched by

    rx{ foo <!before bar> }

=head2 Lookbehind assertions

To check that a pattern appears before another pattern, one can use a
lookbehind assertion via the C<after> assertion.  This has the form:

    <?after pattern>

Thus, to search for the string C<bar> which is immediately preceded by the
string C<foo>, one could use the following regexp:

    rx{ <?after foo> bar }

which one could use like so:

    say "foobar" ~~ rx{ <?after foo> bar };   #->  bar

However, if you want to search for a pattern which is B<not> immediately
preceded by some pattern, then you need to use a negative lookbehind
assertion, this has the form:

    <!after pattern>

Hence all occurrences of C<bar> which I<do not> have C<foo> before them
would be matched by

    rx{ <!after foo> bar }

=head1 Best practices and gotchas

Regexes and L<grammars|/language/grammars> are a whole programming paradigm
that you have to learn (if you don't already know it very well).

To help you write robust regexes and grammars, here are some best practices
that the authors have found useful. These range from small-scale code layout
issues to what actually to match, and help to avoid common pitfalls and
writing unreadable code.

=head2 Code layout

Without the C<:sigspace> adverb, whitespace is not significant in Perl 6
regexes. Use that to your own advantage and insert whitespace where it
increases readability. Also insert comments where necessary.

Compare the very compact

    my regex float { <[+-]>?\d*'.'\d+[e<[+-]>?\d+]? }

to the more readable

    my regex float {
         <[+-]>?        # optional sign
         \d*            # leading digits, optional
         '.'
         \d+
         [              # optional exponent
            e <[+-]>?  \d+
         ]?
    }

As a rule of thumb, use whitespace around atoms and inside groups.  Put
quantifiers directly after the atom, without inserting a blank.  Vertically
align opening and closing brackets and parentheses.

When you use a list of alternations inside a parenthesis or brackets, align
the vertical bars:

    my regex example {
        <preabmle>
        [
        || <choice_1>
        || <choice_2>
        || <choice_3>
        ]+
        <postamble>
    }

=head2 Keep it small

Regexes come with very little boilerplate, so they are often more compact
than regular code.  Thus it is important to keep regexes short.

When you can come up with name for a part of a regex, it is usually best to
put it into a separate, named regex.

For example you could take the float regex from earlier:

    my regex float {
         <[+-]>?        # optional sign
         \d*            # leading digits, optional
         '.'
         \d+
         [              # optional exponent
            e <[+-]>?  \d+
         ]?
    }

And decompose it into parts:

    my token sign { <[+-]> }
    my token decimal { \d+ }
    my token exponent { 'e' <sign>? <decimal> }
    my regex float {
        <sign>?
        <decimal>?
        '.'
        <decimal>
        <exponent>?
    }

That helps especially when the regex becomes more complicated. For example
you might want to make the decimal point optional if an exponent is there.

    my regex float {
        <sign>?
        [
        || <decimal>?  '.' <decimal> <exponent>?
        || <decimal> <exponent>
        ]
    }

=head2 What to match

Often the input data format has no clear-cut specification, or the
specification is not known to the programmer.  Then it is good to be liberal
in what you expect, but only as long as there are no ambiguities possible.

For example in C<ini> files:

    [section]
    key=value

What can be inside the section header? Allowing only a word might be too
restrictive, somebody might write C<[two words]>, or use dashes, or so.
Instead of asking what's allowed on the inside, it might be worth asking
instead: I<what's not allowed?>

Clearly, closing brackets are not allowed, because C<[a]b]> would be rather
ambiguous.  By the same argument, opening brackets should be forbidden.
This leaves us with

    token header { '[' <-[ \[\] ]>+ ']' }

which is fine if you are only processing one line. But if you're processing
a whole file, suddenly the regex parses

    [with a
    newline in between]

which might not be a good idea.  A pragmatic compromise would be

    token header { '[' <-[ \[\] \n ]>+ ']' }

and then, in the post-processing, strip leading and trailing spaces and tabs
from the section header.

=head2 Matching Whitespace

The C<:sigspace> adverb (or using the C<rule> declarator instead of C<token>
or C<regex>) is very handy for implicitly parsing whitespace that can appear
in many places.

Going back to the example of parsing C<ini> files, we have

    my regex kvpair { \s* <key=identifier> '=' <value=identifier> \n+ }

which is probably not as liberal as we want it to be. Since the user might
put spaces around the equals sign, it should rather read

    my regex kvpair { \s* <key=identifier> \s* '=' \s* <value=identifier> \n+ }

That's growing unwieldy pretty quickly. So instead one can write

    my rule kvpair { <key=identifier> '=' <value=identifier> \n+ }

But wait! The implicit whitespace matching after the value uses up all
whitespace, including newline characters, so the C<\n+> doesn't have
anything left to match (and C<rule> also disables backtracking, so no luck
here).

Therefore it is important to redefine your definition of implicit whitespace
to whitespace that is not significant in the input format.

This works by redefining the token C<ws>, however it only works in
L<grammars|/language/grammars>:

    grammar IniFormat {
        token ws { <!ww> \h* }
        rule header { '[' (\w+) ']' \n+ }
        token identifier  { \w+ }
        rule kvpair { \s* <key=identifier> '=' <value=identifier> \n+ }
        token section {
            <header>
            <kvpair>*
        }

        token TOP {
            <section>*
        }
    }

    my $contents = q:to/EOI/;
        [passwords]
            jack = password1
            joy = muchmoresecure123
        [quotas]
            jack = 123
            joy = 42
    EOI
    say so IniFormat.parse($contents);

Besides putting all regexes into a grammar and turning them into tokens
(because they don't need to backtrack anyway) the interesting new bit is

        token ws { <!ww> \h* }

which gets called for implicit whitespace parsing. It matches when it is not
between two word characters (C<< <!ww> >>, negated "within word" assertion),
and zero or more horizontal space characters. The limitation to horizontal
whitespace is important, because newlines (which are vertical whitespace)
delimit records and shouldn't be matched implicitly.

Still there is some whitespace-related trouble lurking.  The regex C<\n+>
won't match a string like C<"\n \n">, because there is a blank between the
two newlines. To allow such input strings, replace C<\n+> by C<\n\s*>.

=end pod

# vim: expandtab shiftwidth=4 ft=perl6