Skip to content

Commit

Permalink
feature enhanced_re_xx
Browse files Browse the repository at this point in the history
  • Loading branch information
khwilliamson committed Oct 22, 2022
1 parent 0e7c154 commit f577889
Showing 1 changed file with 88 additions and 0 deletions.
88 changes: 88 additions & 0 deletions rfcs/rfc0018.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,88 @@
# RFC - enhanced regex /xx

## Preamble

Author: Karl Williamson <khw@cpan.org>
ID: KHW-0001
Status: Draft

## Abstract

Let programmers improve the readability of regular expression patterns beyond
what is possible now.

## Motivation

Regular expression patterns were designed for concision rather than clarity.

The /x regular expression pattern modifier was created to enable adding
comments and white space to patterns to make them more readable. It suffers
from not working for bracketed character classes, and silently compiling to
something unintended when the programmer forgets to mark literal white space,
and much worse, literal '#'. This last silently swallows the rest of the line
that was supposed to be a part of the pattern.

I eventually added /xx to at least allow tabs and blanks inside bracketed
character classes. This allows a very minor improvement in their readability.
I could not figure out a way to extend this to allow comments and multiple
lines inside such a class without making it even more likely that the pattern
would silently compile to something unintended. But now, I think this RFC
fixes that.

## Specification

I propose adding a new opt-in feature. Call it, for now, "feature
enhanced_re_xx". Within its scope, the /xx modifier would change things so
that inside a bracketed character class [...], any vertical space would be
treated as a blank, essentially ignored. Any unescaped '#' would begin a
comment that ends at the end of the line.

This would change the existing /x behavior where the portion of the line after
the '#' is parsed, looking for a potential pattern terminating delimiter.
Under this feature to terminate a pattern, do so before any '#' on a line.
If an unescaped terminating delimiter is found after a '#' on a line, a warning
would be raised.

And an unescaped '#' within a comment would raise a warning. So

$a[$i] = ~ qr/ [ a-z # We need to match the lowercase alphabetics
! @ # . * # And certain punctuation
0-9 # And the digits (which can only occur in $a[0])
]
/xx;

would warn.

It might be that an unescaped '#' that isn't of the form \s+#\s+ should
warn to catch things like if the above example's second line were just

!@#.*

Also, any comments inside [...] would check for an unescaped ']' on the same
line after a '#', and raise a warning if found. So, something like

$a[$i] = ~ qr/ [ a-z # . * ]
[ A-Z ]
/xx;

would warn. Either escape the '#' or the ']' to suppress it, depending on what
your intent was.

I think these would catch essentially all unintended uses of '#' to mean
not-a-comment, but to be taken literally.

I can't think of anything to catch blanks/tabs being unintentionally ignored.

I also propose that unescaped '#' and vertical space inside bracketed character
classes under /xx be deprecated. /xx has been available only since 5.26;
there's not a huge amount of code that uses it. After the deprecation cycle,
the feature could become automatic, not opt-in, and /xx would have the new
meaning.

Note there is no change to plain /x.

Copyright (C) 2022 Karl Williamson

This document and code and documentation within it may be used, redistributed
and/or modified under the same terms as Perl itself.

0 comments on commit f577889

Please sign in to comment.