Skip to content

Commit

Permalink
Merge pull request #2962 from perl6/regexes
Browse files Browse the repository at this point in the history
Expansion of Regexes:Lexical Conventions section
  • Loading branch information
JJ committed Aug 21, 2019
2 parents b14b5c0 + 22095ed commit d10938e
Showing 1 changed file with 231 additions and 33 deletions.
264 changes: 231 additions & 33 deletions doc/Language/regexes.pod6
Expand Up @@ -6,52 +6,159 @@
X<|Regular Expressions>
Regular expressions, I<regexes> for short, are written in a domain-specific
language that describes text patterns. Pattern matching is the process of
matching those patterns to actual text.
A I<regular expression> is a sequence of characters that defines a certain text
pattern, typically one that one wishes to find in some large body of text.
In theoretical computer science and formal language theory, regular expressions
are used to describe so-called
L<I<regular languages>|https://en.wikipedia.org/wiki/Regular_language>. Since
their inception in the 1950's, practical implementations of regular expressions,
for instance in the text search and replace functions of text editors, have outgrown
their strict scientific definition. In acknowledgement of this, and in an attempt
to disambiguate, a regular expression in Perl 6 is normally referred to as a
I<regex> (from: I<reg>ular I<ex>pression), a term that is also in common use in
other programming languages.
In Perl 6, regexes are written in a
L<I<domain-specific language>|https://en.wikipedia.org/wiki/Domain-specific_language>,
i.e. a sublanguage or I<slang>. This page describes this language, and explains how
regexes can be used to search for text patterns in strings in a process called
I<pattern matching>.
=head1 X<Lexical conventions|quote,/ /;quote,rx;quote,m>
Perl 6 has special syntax for literal regexes:
Fundamentally, Perl 6 regexes are very much like subroutines: both are code
objects, and just as you can have anonymous subs and named subs, you can have
anonymous and named regexes.
m/abc/; # a regex that is immediately matched against $_
rx/abc/; # a Regex object
/abc/; # a Regex object; shorthand version of 'rx/ /' operator
A regex, whether anonymous or named, is represented by a L<C<Regex>|/type/Regex>
object. Yet, the syntax for constructing anonymous and named C<Regex> objects
differs. We will therefore discuss them in turn.
One difference between the C<m/ /> and C<rx/ /> forms on the one hand, and the
C</ /> form on the other, is that C<m> and C<rx> may be followed by
L<adverbs|/language/regexes#Adverbs>. Another difference is that the
former forms allow delimiters other than the slash to be used:
=head2 Anonymous regex definition syntax
m{abc}; # curly braces as delimiters
rx:i[abc]; # :i adverb, and square brackets as delimiters
An anonymous regex may be constructed in one of the following ways:
As may be inferred from the above example, the use of a colon as an alternative
delimiter would clash with the use of adverbs; accordingly, such use of the
colon is forbidden. Similarly, parentheses cannot be used as alternative regex
delimiters, at least not without a space between C<m> or C<rx> and the
opening delimiter. This is because identifiers that are immediately followed by
parentheses are always parsed as a subroutine call. For example, in C<rx()> the L<call
operator|/language/operators#postcircumfix_(_)> C<()> invokes the subroutine
C<rx>. The form C<rx ( abc )>, however, I<does> define a Regex object.
rx/pattern/; # an anonymous Regex object; 'rx' stands for 'regex'
/pattern/; # an anonymous Regex object; shorthand for 'rx/.../'
Here's an example that illustrates the difference between the C<m/ /> and C</ />
operators:
regex { pattern }; # keyword-declared anonymous regex; this form is
# intended for defining named regexes and is discussed
# in that context in the next section
my $match;
$_ = "abc";
$match = m/.+/; say $match; say $match.^name; # OUTPUT: «「abc」␤Match␤»
$match = /.+/; say $match; say $match.^name; # OUTPUT: «/.+/␤Regex␤»
The C<rx/ /> form has two advantages over the bare shorthand form C</ />.
Firstly, it enables the use of delimiters other than the slash, which may be
used to improve the readability of the regex definition:
rx{ '/tmp/'.* }; # the use of curly braces as delimiters makes this first
rx/ '/tmp/'.* /; # definition somewhat easier on the eyes than the second
Although the choice is vast, not every character may be chosen as an alternative
regex delimiter:
=begin item
You cannot use whitespace or alphanumeric characters as delimiters. Whitespace
in regex definition syntax is generally optional, except where it is required to
distinguish from function call syntax (discussed hereafter).
=end item
=begin item
Parentheses can be used as alternative regex delimiters, but only with a space
between C<rx> and the opening delimiter. This is because identifiers that are
immediately followed by parentheses are always parsed as a subroutine call. For example,
in C<rx()> the L<call operator|/language/operators#postcircumfix_(_)> C<()>
invokes the subroutine C<rx>. The form C<rx ( abc )>, however, I<does> define a
C<Regex> object.
=end item
=begin item
Use of a colon as a delimiter would clash with the use of
L<adverbs|/language/regexes#Adverbs>, which take the form C<:adverb>;
accordingly, such use of the colon is forbidden.
=end item
=begin item
The hashmark C<#> is not available as a delimiter since it is parsed as the start
of a L<comment|/language/syntax#Single-line_comments> that runs until the end of
the line.
=end item
Secondly, the C<rx> form enables the use of
L<regex adverbs|/language/regexes#Regex_adverbs>, which may be placed between C<rx> and the
opening delimiter to modify the definition of the entire regex:
rx:r:s/pattern/; # :r (:ratchet) and :s (:sigspace) adverbs, defining
# a racheting regex in which whitespace is significant
Although anonymous regexes are not, as such, I<named>, they may effectively be
given a name by putting them inside a named variable, after which they can be
referenced, both outside of an embedding regex and from within an embedding
regex by means of L<interpolation|/language/regexes#Regex_interpolation>:
my $regex = / R \w+ /;
say "Zen Buddists like Raku too" ~~ $regex; # OUTPUT: 「Raku」
my $regex = /pottery/;
"Japanese pottery rocks!" ~~ / <$regex> /; # Interpolation of $regex into /.../
say $/; # OUTPUT: 「pottery」
=head2 Named regex definition syntax
A named regex may be constructed using the C<regex> declarator as follows:
regex R { pattern }; # a named Regex object, named 'R'
Unlike with the C<rx> form, you cannot chose your preferred delimiter: curly
braces are mandatory. In this regard it should be noted that the definition of a
named regex using the C<regex> form is syntactically similar to the definition
of a subroutine:
my sub S { /pattern/ }; # definition of Sub object (returning a Regex)
my regex R { pattern }; # definition of Regex object
which emphasizes the fact that a L<C<Regex>|/type/Regex> object represents code
rather than data:
&S ~~ Code; # OUTPUT: True
&R ~~ Code; # OUTPUT: True
&R ~~ Method; # OUTPUT: True (A Regex is really a Method!)
Also unlike with the C<rx> form for defining an anonymous regex, the definition
of a named regex using the C<regex> keyword does not allow for adverbs to be
inserted before the opening delimiter. Instead, adverbs that are to modify the
entire regex pattern may be included first thing within the curly braces:
regex R { :i pattern }; # :i (:ignorecase), renders pattern case insensitive
Alternatively, by way of shorthand, it is also possible (and recommended) to use
the C<rule> and C<token> variants of the C<regex> declarator for defining a
C<Regex> when the C<:ratchet> and C<:sigspace> adverbs are of interest:
regex R { :r pattern }; # apply :r (:ratchet) to entire pattern
token R { pattern }; # same thing: 'token' implies ':r'
regex R { :r :s pattern }; # apply :r (:ratchet) and :s (:sigspace) to pattern
rule R { pattern }; # same thing: 'rule' implies ':r:s'
Named regexes may be used as building blocks for other regexes, as they are
methods that may called from within other regexes using the C«<regex-name>»
syntax. When they are used this way, they are often referred to as I<subrules>;
see for more details on their use L<here|/language/regexes#Subrules>.
L<C<Grammars>|/type/Grammar> are the natural habitat of subrules, but many common
predefined character classes are also implemented as named regexes.
Whitespace in literal regexes is ignored unless the
L<C<:sigspace> adverb|/language/regexes#Sigspace> is used to make whitespace
=head2 Regex readability: whitespace and comments
Whitespace in regexes is ignored unless the
L<C<:sigspace>|/language/regexes#Sigspace> adverb is used to make whitespace
syntactically significant.
In addition to whitespace, comments may be used inside of regexes to improve
their readability and comprehensibility just as in Perl 6 code in general. This
is true for both L<single line comments|/language/syntax#Single-line_comments>
and L<multi line/embedded comments|
/language/syntax#Multi-line_/_embedded_comments>:
their comprehensibility just as in code in general. This is true for both
L<single line comments|/language/syntax#Single-line_comments> and
L<multi line/embedded comments|/language/syntax#Multi-line_/_embedded_comments>:
my $regex = rx/ \d ** 4 #`(match the year YYYY)
'-'
Expand All @@ -61,6 +168,97 @@ and L<multi line/embedded comments|
say '2015-12-25'.match($regex); # OUTPUT: «「2015-12-25」␤»
=head2 Match syntax
There are a variety of ways to match a string against a regex. Irrespective of
the syntax chosen, a successful match results in a L<C<Match>|/type/Match>
object. In case the match is unsuccessful, the result is L<C<Nil>|/type/Nil>. In
either case, the result of the match operation is available via the special
match variable L<C<$/>|/syntax/$$SOLIDUS>.
The most common ways to match a string against an anonymous regex C</pattern/> or
against a named regex C<R> include the following:
=begin item
I«Smartmatch: "string" ~~ /pattern/, or "string" ~~ /<R>/»
L<Smartmatching|/language/operators#index-entry-smartmatch_operator> a string
against a C<Regex> performs a regex match of the string against the C<Regex>:
say "Go ahead, make my day." ~~ / \w+ /; # OUTPUT: «「Go」␤»
my regex R { me|you };
say "You talkin' to me?" ~~ / <R> /; # OUTPUT: «「me」␤ R => 「me」␤»
say "May the force be with you. ~~ &R ; # OUTPUT: «「you」␤»
The different outputs of the last two statements show that these two ways of
smartmatching against a named regex are not identical. The difference arises
because the method call C«<R>» from within the anonymous regex C</ /> installs
a so-called L<'named capture'|/language/regexes#Named_captures> in the C<Match>
object, while the smartmatch against the named C<Regex> as such does not.
=end item
=begin item
I«Explicit topic match: m/pattern/, or m/<R>/»
The match operator C<m/ /> immediately matches the topic variable
L<C<$_>|/language/variables#index-entry-topic_variable> against the regex
following the C<m>.
As with the C<rx/ /> syntax for regex definitions, the match operator may be
used with adverbs in between C<m> and the opening regex delimiter, and with
delimiters other than the slash. However, while the C<rx/ /> syntax may only be
used with L<I<regex adverbs>|/language/regexes#Regex_adverbs> that affect the
compilation of the regex, the C<m/ /> syntax may additionally be used with
L<I<matching adverbs>|/language/regexes#Matching_adverbs> that determine how the
regex engine is to perform pattern matching.
Here's an example that illustrates the primary difference between the C<m/ />
and C</ /> syntax:
my $match;
$_ = "abc";
$match = m/.+/; say $match; say $match.^name; # OUTPUT: «「abc」␤Match␤»
$match = /.+/; say $match; say $match.^name; # OUTPUT: «/.+/␤Regex␤»
=end item
=begin item
I«Implicit topic match in sink and boolean contexts»
In case a C<Regex> object is used in sink context, or in a context in which it
is coerced to L<C<Bool>|/type/Bool>, the topic variable
L<C<$_>|/language/variables#index-entry-topic_variable> is automatically matched
against it:
$_ = "dummy string"; # Set the topic explicitly
rx/ s.* /; # Regex object in sink context matches automatically
say $/; # OUTPUT: 「string」
say $/ if rx/ d.* /; # Regex object in boolean context matches automatically
# OUTPUT: 「dummy string」
=end item
=begin item
I«Match method: "string".match: /pattern/, or "string".match: /<R>/»
The L<C<match>|/type/Str#method_match> method is analogous to the C<m/ />
operator discussed above. Invoking it on a string, with a C<Regex> as an
argument, matches the string against the C<Regex>.
=end item
=begin item
I«Parsing grammars: grammar-name.parse($string)»
Although parsing a L<Grammar|/language/grammars> involves more than just
matching a string against a regex, this powerful regex-based text destructuring
tool can't be left out from this overview of common pattern matching methods.
If you feel that your needs exceed what simple regexes have to offer, check out this
L<grammar tutorial>|/language/grammar_tutorial> to take regexes to the next level.
=end item
=head1 Literals and metacharacters
A regex describes a pattern to be matched in terms of literals and
Expand Down

0 comments on commit d10938e

Please sign in to comment.