Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Browse files
Browse the repository at this point in the history
Merge pull request #2431 from chsanch/fix-2410
Split regexes page and add regexes best practices page
- Loading branch information
Showing
2 changed files
with
207 additions
and
200 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,204 @@ | ||
| =begin pod :tag<perl6> | ||
| =TITLE Regexes: Best Practices and gotchas | ||
| =SUBTITLE Some tips on regexes and grammars | ||
| To help with robust regexes and grammars, here are some best practices | ||
| for code layout and readability, what to actually match, and avoiding common | ||
| pitfalls. | ||
| =head1 Code layout | ||
| Without the C<:sigspace> adverb, whitespace is not significant in Perl 6 | ||
| regexes. Use that to your own advantage and insert whitespace where it | ||
| increases readability. Also, insert comments where necessary. | ||
| Compare the very compact | ||
| my regex float { <[+-]>?\d*'.'\d+[e<[+-]>?\d+]? } | ||
| to the more readable | ||
| my regex float { | ||
| <[+-]>? # optional sign | ||
| \d* # leading digits, optional | ||
| '.' | ||
| \d+ | ||
| [ # optional exponent | ||
| e <[+-]>? \d+ | ||
| ]? | ||
| } | ||
| As a rule of thumb, use whitespace around atoms and inside groups; put | ||
| quantifiers directly after the atom; and vertically align opening and closing | ||
| square brackets and parentheses. | ||
| When you use a list of alternations inside parentheses or square brackets, align | ||
| the vertical bars: | ||
| my regex example { | ||
| <preamble> | ||
| [ | ||
| || <choice_1> | ||
| || <choice_2> | ||
| || <choice_3> | ||
| ]+ | ||
| <postamble> | ||
| } | ||
| =head1 Keep it small | ||
| Regexes are often more compact than regular code. Because they do so much with | ||
| so little, keep regexes short. | ||
| When you can name a part of a regex, it's usually best to | ||
| put it into a separate, named regex. | ||
| For example, you could take the float regex from earlier: | ||
| my regex float { | ||
| <[+-]>? # optional sign | ||
| \d* # leading digits, optional | ||
| '.' | ||
| \d+ | ||
| [ # optional exponent | ||
| e <[+-]>? \d+ | ||
| ]? | ||
| } | ||
| And decompose it into parts: | ||
| my token sign { <[+-]> } | ||
| my token decimal { \d+ } | ||
| my token exponent { 'e' <sign>? <decimal> } | ||
| my regex float { | ||
| <sign>? | ||
| <decimal>? | ||
| '.' | ||
| <decimal> | ||
| <exponent>? | ||
| } | ||
| That helps, especially when the regex becomes more complicated. For example, | ||
| you might want to make the decimal point optional in the presence of an exponent. | ||
| my regex float { | ||
| <sign>? | ||
| [ | ||
| || <decimal>? '.' <decimal> <exponent>? | ||
| || <decimal> <exponent> | ||
| ] | ||
| } | ||
| =head1 What to match | ||
| Often the input data format has no clear-cut specification, or the | ||
| specification is not known to the programmer. Then, it's good to be liberal | ||
| in what you expect, but only so long as there are no possible ambiguities. | ||
| For example, in C<ini> files: | ||
| =begin code :skip-test | ||
| [section] | ||
| key=value | ||
| =end code | ||
| What can be inside the section header? Allowing only a word might be too | ||
| restrictive. Somebody might write C<[two words]>, or use dashes, etc. | ||
| Instead of asking what's allowed on the inside, it might be worth asking | ||
| instead: I<what's not allowed?> | ||
| Clearly, closing square brackets are not allowed, because C<[a]b]> would be | ||
| ambiguous. By the same argument, opening square brackets should be forbidden. | ||
| This leaves us with | ||
| token header { '[' <-[ \[\] ]>+ ']' } | ||
| which is fine if you are only processing one line. But if you're processing | ||
| a whole file, suddenly the regex parses | ||
| =begin code :lang<text> | ||
| [with a | ||
| newline in between] | ||
| =end code | ||
| which might not be a good idea. A compromise would be | ||
| token header { '[' <-[ \[\] \n ]>+ ']' } | ||
| and then, in the post-processing, strip leading and trailing spaces and tabs | ||
| from the section header. | ||
| =head1 Matching whitespace | ||
| The C<:sigspace> adverb (or using the C<rule> declarator instead of C<token> | ||
| or C<regex>) is very handy for implicitly parsing whitespace that can appear | ||
| in many places. | ||
| Going back to the example of parsing C<ini> files, we have | ||
| my regex kvpair { \s* <key=identifier> '=' <value=identifier> \n+ } | ||
| which is probably not as liberal as we want it to be, since the user might | ||
| put spaces around the equals sign. So, then we may try this: | ||
| my regex kvpair { \s* <key=identifier> \s* '=' \s* <value=identifier> \n+ } | ||
| But that's looking unwieldy, so we try something else: | ||
| my rule kvpair { <key=identifier> '=' <value=identifier> \n+ } | ||
| But wait! The implicit whitespace matching after the value uses up all | ||
| whitespace, including newline characters, so the C<\n+> doesn't have | ||
| anything left to match (and C<rule> also disables backtracking, so no luck | ||
| there). | ||
| Therefore, it's important to redefine your definition of implicit whitespace | ||
| to whitespace that is not significant in the input format. | ||
| This works by redefining the token C<ws>; however, it only works for | ||
| L<grammars|/language/grammars>: | ||
| grammar IniFormat { | ||
| token ws { <!ww> \h* } | ||
| rule header { \s* '[' (\w+) ']' \n+ } | ||
| token identifier { \w+ } | ||
| rule kvpair { \s* <key=identifier> '=' <value=identifier> \n+ } | ||
| token section { | ||
| <header> | ||
| <kvpair>* | ||
| } | ||
| token TOP { | ||
| <section>* | ||
| } | ||
| } | ||
| my $contents = q:to/EOI/; | ||
| [passwords] | ||
| jack = password1 | ||
| joy = muchmoresecure123 | ||
| [quotas] | ||
| jack = 123 | ||
| joy = 42 | ||
| EOI | ||
| say so IniFormat.parse($contents); | ||
| Besides putting all regexes into a grammar and turning them into tokens | ||
| (because they don't need to backtrack anyway), the interesting new bit is | ||
| token ws { <!ww> \h* } | ||
| which gets called for implicit whitespace parsing. It matches when it's not | ||
| between two word characters (C<< <!ww> >>, negated "within word" assertion), | ||
| and zero or more horizontal space characters. The limitation to horizontal | ||
| whitespace is important, because newlines (which are vertical whitespace) | ||
| delimit records and shouldn't be matched implicitly. | ||
| Still, there's some whitespace-related trouble lurking. The regex C<\n+> | ||
| won't match a string like C<"\n \n">, because there's a blank between the | ||
| two newlines. To allow such input strings, replace C<\n+> with C<\n\s*>. | ||
| =end pod | ||
| # vim: expandtab softtabstop=4 shiftwidth=4 ft=perl6 |
Oops, something went wrong.