-
Notifications
You must be signed in to change notification settings - Fork 0
/
2-charclasses
295 lines (229 loc) · 13.5 KB
/
2-charclasses
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
=for history
The Hanging Gardens of Babylon, also known as the Hanging Gardens of Semiramis,
near present-day Al Hillah, Babil in Iraq, are considered to be one of the
original Seven Wonders of the Ancient World. They were built by the Chaldean
king Nebuchadnezzar II around 600 BC. He is reported to have constructed the
gardens to please his homesick wife, Amytis of Media, who longed for the trees
and fragrant plants of her homeland Persia. The gardens were destroyed by
several earthquakes after the 2nd century BC. [...] There is some controversy
as to whether the Hanging Gardens were an actual creation or a poetic creation
due to the lack of documentation of them in the chronicles of Babylonian
history.
=head0 Character classes
=head1 Your robot and you
Picture yourself in the distant future, maybe even on a different planet.
The C<profession> field of your ID holocard says 'cook'; even in the future,
people prefer to combine fine ingredients into splendid taste sensations,
or to employ a person who does it for them. One can't really trust a robot
with the minute culinary decisions involved in preparing an unforgettable meal.
At least, so you keep telling yourself.
You do, however, have the most advanced machinery on the market to help you in
the kitchen. Including, as it happens, a robot. One can't really avoid having
one, the outdoor atmosphere being toxic and all. The robot takes care of doing
tasks that require going outside, such as acquiring ingredients. Ingredients
form a vital part of cooking.
The robot accepts input through a keyboard. (In this futuristic scenario,
researchers gave up on speech recognition, because the state-of-the-art
technology never evolved beyond the ability to distinguish between such words
as "talk", "took", "torque", "Turk" and "toke"; in the end, the researchers
couldn't determine whether they had simply set the bar too high from the start,
or whether the machines just felt like messing with them. So keyboards remained
the dominant technology.) One enters the instructions for which groceries to
pick up through a very specialized "Delivery-Specific Language", one which you
at one point spent a three-day course in chef's school mastering. It looks like
this:
<[ r p q ]>
That one says "pick up B<r>ice or B<p>asta or B<q>uinoa" (either one of them
works fine). The robot goes out and gets one of those for you, whichever
it finds first.
Now, if it always came down to specifying a few things with that simple syntax,
the robot wouldn't really deserve to be called futuristic. But it also comes
with a few really nice shorthands. For example, you could as well have
written the above instruction as:
<[ p..r ]>
(Since C<p>, C<q> and C<r> follow each other alphabetically.) Or, maybe you
just wanted to buy something in the C<a..m> range, I<except> for fish or
eels, which also happen to be toxic (to humans) on this planet:
<[a..m] - [fe]>
The robot obliges and comes back with any of the groceries in the C<a..m>
range (whichever it finds first at the market), but it makes sure not to pick a
fish or an eel. In fact, if your only concern is not getting the fish or the
eel (as they carefully explained in the three-day course), you can give the
following instruction to the robot:
<-[fe]>
In English, that means "pick up absolutely *any* grocery, except fishes or
eels". Ah, the fun you had with this function in the three-day course!
Conversely, if you want to specify a larger number of inclusions and
exclusions, that's fine too:
<[a..m] - [fe] + [p..r]>
And, in fact, you can even start off with a C<+> if you like the symmetry of
it -- though it doesn't add anything semantically:
<+[a..m] - [fe] + [p..r]>
Thus works your groceries robot. To be honest, you've always felt a little
bit of roboticist in you, and more than once have you wondered at the workings
below that shiny panel, and thought about what structures must exist inside the
mechanical assistant in order for it to do its job: go to the market with your
coded instructions, and arrive back (in close to linear time!) with anything
but fish or eel, or whatever it was your instructions said that particular
time.
=head1 Voiding the warranty
One evening when the kitchen has closed for the day, you fall for the
temptation and open up the robot. You unscrew the front panel with the snazzy
logo "Patikmeesho Groceries Engine" on it, and peer within, into near-total
darkness.
You only find a collection of beads in there, stacked into neat boxes and with
names on them. You see blue beads and red beads and green beads. That certainly
doesn't make you any wiser. Thinking that seeing the mechanism in action might
help, you type in your usual rice-pasta-quinoa order on the keyboard:
<[ r p q ]>
Things start moving in the dark insides of the droid! Sooner than you can
blink, deft mechanical arms have picked up a blue bead, whirled it around the
innards of the machine, and finally hung it on a fine wire from the roof of the
dark space inside. It looks like this:
EnumCharList{ ast => 'rpq' }
Exclaiming "A-ha!" to the empty kitchen, you make a mental note as to the
function of the blue beads. Seconds later, you have to manually overpower the
machine, when it tries to leave the kitchen on its way to the market picking up
either rice, pasta or quinoa. You put it back in its corner and flip a
'dry-run' switch so that the bot won't attempt to leave again.
You try another query:
<[a..m] - [fe]>
Three beads leap into action: two blue ones and one red. When they settle
down, suspended from the roof of the inside of the robot just like the last
time, they hang from each other in a tree-like structure:
Concat
|
+--EnumCharList{ ast => 'fe', :isnegated, :iszerowidth }
|
+--EnumCharList{ ast => 'abcdefghijklm' }
This time, you have to look for quite some time before it all sinks in.
First off, you note that red beads ("Concat") seem to tie together the blue
beads when the order consists of many parts. Also, it doesn't escape your
attention that your order has been... reversed. Hm.
And those new flags, C<:isnegated> and C<:iszerowidth>? You can sort of
grok the first one, since the C<[fe]> part of the order really is negated
(you really I<don't> want the fish or the eel on this planet). The other one,
however...
A flash of vivid imagination strikes you. You picture the robot at the market,
picking up a certain grocery item. It must now run two tests on this item,
the first being "not fish or eel", and the second being "something in the
range C<a..m>". It should do both of those tests on the I<same> item and not,
say, the first test on one item and the second test on some other item. That
could have disastrous and confusing, not to mention erroneous, consequences.
You therefore decide, preliminarily, that C<:iszerowidth> means that it should
hold onto the same grocery item, and not proceed to the next. Perhaps there's a
"standard unit of grocery width" so that the robot knows how far to move to get
to the next grocery, and this flag temporarily sets that width to zero. Yeah,
that makes sense.
The structure of beads now starts to appear as a sort of "operational
procedure" for the robot. Each time it stops before a grocery item, it picks it
up and inspects it according to rules given by the structure of beads. If it
can get through all the beads, the machine knows it found something that it can
take back home. For simple orders, one bead is enough. For the more complex
ones, the robot puts together a whole tree of beads, crystalizing the textual
input into globular chunks of data.
Just for kicks, you decide to try another query. Still haven't managed to
activate those green beads.
<[a..m] + [p..r]>
Hah! This time, one green bead does the dance along with two blue ones.
Alt
|
+--EnumCharList{ ast => 'abcdefghijklm' }
|
+--EnumCharList{ ast => 'pqr' }
This time, things are much simpler. There are no extra flags, the blue beads
weren't reversed, and you seem to grasp the workings immediately. Obviously,
"Alt" means "do either of these B<alt>ernatives". So, no mystery there.
Being a very thorough person, you try a few last variations on the syntax.
<+[a..m]>
That order causes one blue bead to spring into action and settle down.
EnumCharList{ ast => 'abcdefghijklm' }
Simple as that. The C<+> at the beginning doesn't seem to matter much.
What about this one?
<-[fe]>
Several beads this time. One blue, one red, and... hey, where did that black
bead come from? Guess you didn't see it until now, due to the darkness.
Concat
|
+--EnumCharList{ ast => 'fe', :isnegated, :iszerowidth }
|
+--CCShortcut{ ast => '.' }
You stare at the new, black bead. What could it mean? It appears to you as
the final, remaining mystery of the inner workings of the groceries robot.
Then you recall what the order C<< <-[fe]> >> means: "pick up absolutely B<any>
grocery, except fishes or eels". The black bead sits at the position of the
positive check done by the robot, the things it I<should> accept. The black
bead must be the "absolutely B<any>" part of it. That fits.
Totally contented with this foray into the robot, you screw back the panel
with the PGE logo, flip the 'dry-run' switch, and tell the robot to go fetch
some tasty grocery to celebrate. Anything but a fish or an eel.
Life rocks in the future.
=head1 Regex nodetypes, sitting in a tree...
Ok; back to talking about regexes. The above story about a robot picking up
groceries has, of course, nothing to do with regexes... except that its
operation language seems eerily reminiscent of Perl 6 character classes. (As
usual, S05 has the whole story.) And the "inner workings" of the robot have a
lot to do with the inner workings of how the Parrot Grammar Engine converts the
regex into an abstract syntax tree, which it then uses to go shopping for
characters.
We've seen a few new types of regex parts today. The C<EnumCharList> stands
out as the core player. It represents a sort of "character class atom", the
indivisible part with which to build more complicated structures. The
C<EnumCharList> contains a list of characters, and it can be negated and
have zero width. But all that is handled internally, under the regex engine's
hood.
The C<CCShortcut> provides us with a simple way to talk about common character
classes without listing all the constituent characters. We saw the C<.> above,
which stands for "any character". Below you find a complete list of the
character classes in Perl 6 regexes:
. Any character
\w Any letter, digit or underscore
\d Any digit
\s Any whitespace character
\h Any horizontal whitespace character (space, tab, etc)
\v Any vertical whitespace character (return, line feed, etc)
\e The escape character (0x1B)
\f The form feed character (0x0C)
\r The line feed character (0x0D)
\t The tab character (0x09)
Apart from the dot, every shortcut also has an upper-case counterpart, which
means "anything but". Thus, you'd use C<\S> to mean "anything but a whitespace
character", for example.
The C<Concat> is a way to sequence parts of a regex. In the more common use
case it strings together the several parts of a regex into one single abstract
syntax tree. C</ foo bar baz /> would parse down into a C<Concat> of three
literals. C<Concat> then works as the glue that unites the whole expression,
and the grammar engine makes sure to execute the bits in the order given. But
in character classes we sort of hijack the functionality of C<Concat> a bit,
checking the same character twice by switching on the C<:iszerowidth> flag on
an C<EnumCharList>.
The C<Alt> has a similar background. We'll cover its actual use more in the
next installment. In short, it does alternations, making C</ foo | bar /> match
either C<foo> or C<bar>. Natually, it's a perfect stand-in for what the C<+>
means in character classes.
=head1 ...P-A-R-S-I-N-G
Last time, when we dealt with literals and anchors, the match results that
interested us consisted of a 'yes' or a 'no'. We could also have been
interested in the actual position of the match. But it wouldn't have made
much sense to go to that position in the string and see what we got back,
because neither literals or anchors carry around any degrees of freedom.
(Literals "just" represent themselves, i.e. one possible match, and anchors
"just" talk about the gaps between characters and their properties.)
Character classes, by design, allow for degrees of freedom. Just as a
futuristic chef somewhere on another planet might have some slight interest in
what grocery the robot arrived back with, we might have some slight interest
not only in the position of a possible match, but the actual substring that
matched at that position.
Now, one could provide such a substring by doing the actual copying of the
substring in question, allocating new memory for it and so on. I<Or> one
could provide (a reference to) the original string, together with information
saying "our match started I<here> and ended I<here>". Thus removing one string
allocation for every time we find a match.
Whereas in the last installment we might have contented ourselves with
returning a boolean answer ("did the regex match the target string?"), or
maybe an integer answer ("at what position did match occur?"), it now sounds
more like we want to return a small bundle of information, something of a
I<match object>, perhaps, describing the C<target> string, along with C<from>
and C<to> positions delineating the match. As we shall see in part 4, the match
object will also quite naturally play host to captured submatches.
I hope you enjoyed your anything-but-fish-or-eel. See you in part 3.