-
Notifications
You must be signed in to change notification settings - Fork 292
/
grammars.pod
178 lines (137 loc) · 5.72 KB
/
grammars.pod
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
=begin pod
=TITLE Grammars
=SUBTITLE Parsing and interpreting text
Grammars are a powerful tool used to destructure text and often to
return data structures that have been created by interpreting that text.
For example, Perl 6 is parsed and executed using a Perl 6-style grammar.
An example that's more practical to the common Perl 6 user is the
L<JSON::Simple module|https://github.com/moritz/json>, which can
deserialize any valid JSON file, however the deserializing code is
written in less than 100 lines of simple, extensible code.
If you didn't like grammar in school, don't let that scare you off grammars.
Grammars allow you to group regexes, just as classes allow you to group
methods of regular code.
=head1 X<Named Regexes|declarator,regex;declarator,token;declarator,rule>
The main ingredient of grammars is named L<regexes|/language/regexes>. While
the syntax of L<Perl 6 Regexes|/language/regexes> is outside the scope of this
document, I<named> regexes have a special syntax, similar to subroutine
definitions:N<In fact, named regexes can even take extra arguments, using the
same syntax as subroutine parameter lists>
=begin code :allow<B>
my B<regex number {> \d+ [ \. \d+ ]? B<}>
=end code
In this case, we have to specify that the regex is lexically scoped
using the C<my> keyword, because named regexes are normally used within
grammars.
Being named gives us the advantage of being able to easily reuse the
regex elsewhere:
=begin code :allow<B>
say "32.51" ~~ B<&number>;
say "15 + 4.5" ~~ /B<< <number> >>\s* '+' \s*B<< <number> >>/
=end code
B<C<regex>> isn't the only declarator for named regexes -- in fact, it's the
least common. Most of the time, the B<C<token>> or B<C<rule>> declarators are
used. These are both I<ratcheting>, which means that the match engine won't
back up and try again if it fails to match something. This will usually do what you want, but isn't appropriate for all cases:
=begin code :allow<B>
my regex works-but-slow { .+ q }
my token fails-but-fast { .+ q }
my $s = 'Tokens won't backtrack, which makes them fail quicker!';
say so $s ~~ &works-but-slow; # True
say so $s ~~ &fails-but-fast; # False, the entire string get taken by the .+
=end code
The only difference between the C<token> and C<rule> declarators is that the
C<rule> declarator causes L<C<:sigspace>|/language/regexes#Sigspace> to go into
effect for the Regex:
=begin code :allow<B>
my token non-space-y { once upon a time }
my rule space-y { once upon a time }
say 'onceuponatime' ~~ &non-space-y;
say 'once upon a time' ~~ &space-y;
=end code
=head1 X<Creating Grammars|class,Grammar;declarator,grammar>
=SUBTITLE Group of named regexes that form a formal grammar
class Grammar is Cursor { }
C<Grammar> is the superclass that classes automatically get when they
are declared with the C<grammar> keyword instead of C<class>. Grammars
should only be used to parse text; if you wish to extract complex data,
an L<action class|/language/grammars#Action_Classes> is recommended to
be used in conjunction with the grammar.
=begin code :allow<B L>
B<grammar> CSV {
token TOP { [ <line> \n? ]+ }
token line {
^^ # Beginning of a line
<value>* % \, # Any number of <value>s with commas in between them
$$ # End of a line
}
token value {
[
| <-[",\n]> # Anything not a double quote, comma or newline
| <quoted-text> # Or some quoted text
]* # Any number of times
}
token quoted-text {
\"
[
| <-["\\]> # Anything not a " or \
| '\"' # Or \", an escaped quotation mark
]* # Any number of times
\"
}
}
say "Valid CSV file!" if CSV.L<parse>( q:to/EOCSV/ );
Year,Make,Model,Length
1997,Ford,E350,2.34
2000,Mercury,Cougar,2.38
EOCSV
=end code
=head2 Methods
=head3 method parse
method parse($str, :$rule = 'TOP', :$actions) returns Match:D
Matches the grammar against C<$str>, using C<$rule> as the starting rule,
optionally applying C<$actions> as its actions object.
This will fail if the grammar does not parse the I<entire> string. If a
parse of only a part of the string is desired, use L<subparse>.
The method returns the resulting L<Match> object and also sets the caller's C<$/>
variable to the Match object.
=begin code :allow<B>
say CSVB<.parse>( q:to/EOCSV/ );
Year,Make,Model,Length
1997,Ford,E350,2.34
2000,Mercury,Cougar,2.38
EOCSV
=end code
This outputs:
「Year,Make,Model,Length
1997,Ford,E350,2.34
2000,Mercury,Cougar,2.38
」
line => 「Year,Make,Model,Length」
value => 「Year」
value => 「Make」
value => 「Model」
value => 「Length」
line => 「1997,Ford,E350,2.34」
value => 「1997」
value => 「Ford」
value => 「E350」
value => 「2.34」
line => 「2000,Mercury,Cougar,2.38 」
value => 「2000」
value => 「Mercury」
value => 「Cougar」
value => 「2.38 」
=head3 method subparse
method subparse($str, :$rule = 'TOP', :$actions) returns Match:D
Matches the grammar against C<$str>, using C<$rule> as the starting rule,
optionally applying C<$actions> as its actions object.
Unlike L<parse>, C<subparse> will allow the grammar to match only part
of the supplied string.
=head3 method parsefile
method parsefile(Cool $filename as Str, *%opts) returns Match:D
Parses the contents of the file C<$filename> with the L<parse> method,
passing any named options in C<%opts>.
=head1 Action Classes
TODO
=end pod