Skip to content

Commit

Permalink
Browse files Browse the repository at this point in the history
define extensible boundary syntax
The Unicode folks seem to want an extensible boundary syntax with \b,
but we've abandoned \b for boundary, so it's now <|x> for various
values of x.  (And <!|x> is the negation, so no need for <|X>.)
<?wb> is now <|w>.
  • Loading branch information
TimToady committed Feb 9, 2011
1 parent 4ec52e3 commit 81058c1
Showing 1 changed file with 10 additions and 2 deletions.
12 changes: 10 additions & 2 deletions S05-regex.pod
Expand Up @@ -1775,6 +1775,14 @@ Note that a consequence of the previous section is that you also get

for free, which fails if the current rule would match again at this location.

=item *

A leading C<|> indicates some kind of a zero-width boundary.

<|w> word boundary
<|g> grapheme boundary (always matches in grapheme mode)
<|c> codepoint boundary (always matches in grapheme/codepoint mode)

=back

The following tokens include angles but are not required to balance:
Expand Down Expand Up @@ -1809,8 +1817,8 @@ These tokens are considered declarative, but may force backtracking behavior.

A C<«> or C<<< << >>> token indicates a left word boundary. A C<»> or
C<<< >> >>> token indicates a right word boundary. (As separate tokens,
these need not be balanced.) Perl 5's C<\b> is replaced by a C<< <?wb> >>
"word boundary" assertion, while C<\B> becomes C<< <!wb> >>. (None of
these need not be balanced.) Perl 5's C<\b> is replaced by a C<< <|w> >>
"word boundary" assertion, while C<\B> becomes C<< <!|w> >>. (None of
these are dependent on the definition of C<< <.ws> >>, but only on the C<\w>
definition of "word" characters. Non-space mark characters are ignored in
calculating word properties of the preceding character. See TR18 1.4.)
Expand Down

0 comments on commit 81058c1

Please sign in to comment.