lib/Language/grammars.pod

=begin pod

=TITLE Grammars

=SUBTITLE Parsing and interpreting text

Grammars are a powerful tool used to destructure text and often to
return data structures that have been created by interpreting that text.

For example, Perl 6 is parsed and executed using a Perl 6-style grammar.

An example that's more practical to the common Perl 6 user is the
L<JSON::Simple module|https://github.com/moritz/json>, which can
deserialize any valid JSON file, however the deserializing code is
written in less than 100 lines of simple, extensible code.

If you didn't like grammar in school, don't let that scare you off grammars.
Grammars allow you to group regexes, just as classes allow you to group
methods of regular code.

=head1 X<Named Regexes|declarator,regex;declarator,token;declarator,rule>

The main ingredient of grammars is named L<regexes|/language/regexes>.  While
the syntax of L<Perl 6 Regexes|/language/regexes> is outside the scope of this
document, I<named> regexes have a special syntax, similar to subroutine
definitions:N<In fact, named regexes can even take extra arguments, using the
same syntax as subroutine parameter lists>

    =begin code :allow<B>
    my B<regex number {> \d+ [ \. \d+ ]? B<}>
    =end code

In this case, we have to specify that the regex is lexically scoped
using the C<my> keyword, because named regexes are normally used within
grammars.

Being named gives us the advantage of being able to easily reuse the
regex elsewhere:

    =begin code :allow<B>
    say "32.51" ~~ B<&number>;
    say "15 + 4.5" ~~ /B<< <number> >>\s* '+' \s*B<< <number> >>/
    =end code

B<C<regex>> isn't the only declarator for named regexes -- in fact, it's the
least common. Most of the time, the B<C<token>> or B<C<rule>> declarators are
used. These are both I<ratcheting>, which means that the match engine won't
back up and try again if it fails to match something. This will usually do what you want, but isn't appropriate for all cases:

    =begin code :allow<B>
    my regex works-but-slow { .+ q }
    my token fails-but-fast { .+ q }
    my $s = 'Tokens won't backtrack, which makes them fail quicker!';
    say so $s ~~ &works-but-slow; # True
    say so $s ~~ &fails-but-fast; # False, the entire string get taken by the .+
    =end code

The only difference between the C<token> and C<rule> declarators is that the
C<rule> declarator causes L<C<:sigspace>|/language/regexes#Sigspace> to go into
effect for the Regex:

    =begin code :allow<B>
    my token non-space-y { once upon a time }
    my rule space-y { once upon a time }
    say 'onceuponatime'    ~~ &non-space-y;
    say 'once upon a time' ~~ &space-y;
    =end code

=head1 X<Creating Grammars|class,Grammar;declarator,grammar>

=SUBTITLE Group of named regexes that form a formal grammar

    class Grammar is Cursor { }

C<Grammar> is the superclass that classes automatically get when they
are declared with the C<grammar> keyword instead of C<class>. Grammars
should only be used to parse text; if you wish to extract complex data,
an L<action class|/language/grammars#Action_Classes> is recommended to
be used in conjunction with the grammar.

    =begin code :allow<B L>
    B<grammar> CSV {
        token TOP { [ <line> \n? ]+ }
        token line {
            ^^            # Beginning of a line
            <value>* % \, # Any number of <value>s with commas in between them
            $$            # End of a line
        }
        token value {
            [
            | <-[",\n]>     # Anything not a double quote, comma or newline
            | <quoted-text> # Or some quoted text
            ]*              # Any number of times
        }
        token quoted-text {
            \"
            [
            | <-["\\]> # Anything not a " or \
            | '\"'     # Or \", an escaped quotation mark
            ]*         # Any number of times
            \"
        }
    }

    say "Valid CSV file!" if CSV.L<parse>( q:to/EOCSV/ );
        Year,Make,Model,Length
        1997,Ford,E350,2.34
        2000,Mercury,Cougar,2.38
        EOCSV
    =end code

=head2 Methods

=head3 method parse

    method parse($str, :$rule = 'TOP', :$actions) returns Match:D

Matches the grammar against C<$str>, using C<$rule> as the starting rule,
optionally applying C<$actions> as its actions object.

This will fail if the grammar does not parse the I<entire> string. If a
parse of only a part of the string is desired, use L<subparse>.

The method returns the resulting L<Match> object and also sets the caller's C<$/>
variable to the Match object.

    =begin code :allow<B>
    say CSVB<.parse>( q:to/EOCSV/ );
        Year,Make,Model,Length
        1997,Ford,E350,2.34
        2000,Mercury,Cougar,2.38
        EOCSV
    =end code

This outputs:

    ｢Year,Make,Model,Length
    1997,Ford,E350,2.34
    2000,Mercury,Cougar,2.38
    ｣
     line => ｢Year,Make,Model,Length｣
      value => ｢Year｣
      value => ｢Make｣
      value => ｢Model｣
      value => ｢Length｣
     line => ｢1997,Ford,E350,2.34｣
      value => ｢1997｣
      value => ｢Ford｣
      value => ｢E350｣
      value => ｢2.34｣
     line => ｢2000,Mercury,Cougar,2.38 ｣
      value => ｢2000｣
      value => ｢Mercury｣
      value => ｢Cougar｣
      value => ｢2.38 ｣

=head3 method subparse

    method subparse($str, :$rule = 'TOP', :$actions) returns Match:D

Matches the grammar against C<$str>, using C<$rule> as the starting rule,
optionally applying C<$actions> as its actions object.

Unlike L<parse>, C<subparse> will allow the grammar to match only part
of the supplied string.

=head3 method parsefile

    method parsefile(Cool $filename as Str, *%opts) returns Match:D

Parses the contents of the file C<$filename> with the L<parse> method,
passing any named options in C<%opts>.

=head1 Action Classes

TODO

=end pod