/
data_struct_selection.tex
2178 lines (1784 loc) · 73.5 KB
/
data_struct_selection.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
\chapter{Case Study: Data Structure Selection}
\label{data_struct_sel}
At this point you have learned about Raku's core data structures,
and you have seen some of the algorithms that use them.
This chapter presents a case study with exercises that let
you think about choosing data structures and practice using them.
But first, I would like to briefly introduce two conditional
structures that have been left aside so far and provide
a couple of new possibilities about subroutine signatures.
\index{signature}
\section{The Ternary Conditional Operator}
\label{ternary operator}
\index{ternary conditional operator}
Consider the following code that tests the value of a positive
integer:
\begin{verbatim}
my $result;
if $num < 10 {
$result = "One digit";
} else {
$result = "More than one digit";
}
say $result;
\end{verbatim}
This is quite simple, but a bit long. This can be rewritten
in just one line of code:
\begin{verbatim}
say $num < 10 ?? "One digit" !! "More than one digit";
\end{verbatim}
The operator is in two parts: the {\tt ??} and the {\tt !!}, which
separate three expressions (hence the name ``ternary operator''):
\index{ternary operator}
\index{operator!ternary}
\begin{itemize}
\item The condition to be evaluated (is \verb'$num' less than 10?);
\item The expression defining the value if the condition is true;
\item The expression defining the value if the condition is false.
\end{itemize}
This statement checks if \verb'$num' is less than 10 and, if
true, prints ``"One digit;'' if the
condition is false, it prints ``More than one digit.''
This operator does not provide any new functionality; it just
offers a more concise syntax.
It is possible to nest several ternary operators to examine
successively multiple choices:
\index{ternary conditional operator, nesting}
\begin{verbatim}
say $value < 10 ?? "One digit" !!
$value < 100 ?? "Two digits" !!
$value < 1000 ?? "Three digits" !!
"More than three digits";
\end{verbatim}
This construct is a form of what is sometimes called a
\emph{switch statement}, because the C language and
many languages derived from it use the {\tt switch} keyword
to describe such a multiple choice conditional.
\index{switch statement}
This is much more concise and often more convenient than nested
{\tt if ... then ... else} conditionals, but the next section
provides a more powerful {\tt switch} type of statement.
\section{The {\tt given ... when} ``Switch'' Statement}
\label{given_when}
\index{given statement}
\index{when statement}
\index{switch statement}
Raku has a ``switch'' statement, written with the
{\tt given} and {\tt when} keywords. The
{\tt given} keyword introduces the variable or expression
that will be tested, and each of the {\tt when}
statements specify a condition followed by a block that
will execute if the condition is true. By default, the
process stops at the first condition that is satisfied.
The example just above can be rewritten as follows:
\begin{verbatim}
given $value {
when 0..9 { say "One digit" }
when $_ < 99 { say "Two digits" }
when /^\d**3$/ { say "Three digits" }
default { say "More than three digits" }
}
\end{verbatim}
The \verb'given $value' statement ``topicalizes'' the argument,
i.e., assigns the content of \verb'$value' to the \verb'$_'
topical variable (or, more precisely, aliases it to \verb'$_').
The argument to {\tt given} is a simple variable in the
example above, but it can be a complex expression whose
evaluation is stored (and cached) into \verb'$_'.
Each of the {\tt when} conditions is checked against \verb'$_'.
I have written these conditions in three different syntactical
forms to illustrate some of the various possibilities:
\index{topical variable}
\index{alias}
\begin{itemize}
\item The first one checks \verb'$_' (implicitly) against
the \verb'0..9' range.
\index{range operator}
\item The second one compares explicitly \verb'$_' to 99.
\item The third one uses a regex to check whether \verb'$_' has
three digits.
\index{regex}
\item The \verb'default' statement runs only if the other
conditions have failed.
\index{default statement}
\end{itemize}
Only one message will be printed, because the
matching process stops as soon as one condition has been
satisfied, and the \verb'default' clause will run if
no other condition has been met.
If there is no specific operator in the {\tt when} clause,
then it will smart match the expression in the {\tt when}
clause against \verb'$_':
\index{smart match}
\begin{verbatim}
when $foo { ... }
# equivalent to: when $foo ~~ $_ { ... }
\end{verbatim}
Note that the {\tt given} keyword is not doing much more than
topicalizing its argument for the rest of the block.
The {\tt when} clauses are doing the bulk of the real work. In fact,
you could even use the {\tt when} clauses without a {\tt given},
provided you assign the right value to \verb'$_', which, as you
hopefully remember, can be done with a {\tt for} block:
\index{for block}
\begin{verbatim}
my $val = 7;
for $val {
when 0..6 { say "less than"}
when 7 {say "Exact";}
when 8..* {say "greater than";}
}
\end{verbatim}
\index{proceed clause}
It is possible to add a {\tt proceed} clause at the end of
any of the conditional code blocks to prevent the process
from stopping after that code block has succeeded. For example,
you might write this:
\begin{verbatim}
given $value {
when 0..9 { say "One digit"}
when 10..99 { say "Two digits"; proceed}
when 42 { say "The response to the ultimate question"}
when /^\d**3$/ { say "Three digits" }
default { say "More than three digits" }
}
\end{verbatim}
Here, if \verb'$value' is 42, two messages will be displayed,
``Two digits'' and ``The response to the ultimate question,''
because the {\tt proceed} clause in the second code block
prevents the process from stopping on the first successful match.
Good, it seems, but there is a problem. The {\tt proceed} clause
should be used with some care, as it can easily lead
to unexpected results. In fact, \emph{the code above
is actually wrong}: if \verb'$value' has two digits but is
not 42 (if it is, say, 43), the default block will also run,
because the only other successful match had this {\tt proceed}
clause, and will say that there are ``More than three digits''
although this is obviously false.
\label{proceed_ex}
As an exercise, test the above code with various values and
try to find a way to fix the bug with the {\tt proceed}
clause.
Solution: \ref{sol_proceed_ex}
\section{Multiple Conditionals with Junctions}
\label{junction}
\index{junction}
Sometimes, you need to test several possible conditions
and would write something like this:
\begin{verbatim}
if $value == 3 or $value == 5 or $value == 7 { #...
\end{verbatim}
This is quite verbose, and I don't like to have to
repeat the \verb'$value ==' part several times. I
would prefer to be able to write something like:
``If value is 3 or 5 or 7.'' That syntax doesn't work,
but junctions enable you to do almost that:
\begin{verbatim}
if $value == 3|5|7 {
say "$value is either 3, or 5, or 7"
}
\end{verbatim}
The \verb'3|5|7' part is a junction.
Junctions are superpositions of unordered values.
Operations on junctions are executed for each item
of the junction separately (and maybe even in parallel),
and the results are assembled in a junction of the same
type.
The junction types differ when evaluated in boolean context.
The types and associated junction constructors are
\verb'any', \verb'all', \verb'one' and \verb'none'.
\index{any junction}
\index{all junction}
\index{none junction}
\index{one junction}
\begin{table}[htb]%
\centering%
\begin{tabular}{llcl}
\toprule%
\textbf{Type} & \textbf{Constructor} & \textbf{Infix Operator} & \textbf{True if...} \\\midrule
any & any & \verb'|' & at least one value evaluates to \verb|True| \\
all & all & \verb'&' & no value evaluates to False \verb|False| \\
one & one & \verb'^' & exactly one value evaluates to True \verb|True| \\
none & none & & no value evaluates to \verb|True| \\\bottomrule
\end{tabular}
\end{table}
\index{junction types}
For example, \verb'3|5|7' is the same as \verb'any(3, 5, 7)'.
This is another example of an \verb'any' junction:
\begin{verbatim}
my Junction $weekday = any <Monday Tuesday Wednesday
Thursday Friday>
if $day eq $weekday {
say "This is a weekday";
}
\end{verbatim}
In this example the \verb'eq' operator is called with each
pair \verb'$day', 'Monday', \verb'$day', 'Tuesday', etc.
and the result is put into an any-junction again. As
soon as the result is determined (in this case, as soon
as one comparison returns True), it can abort the
execution of the other comparisons.
This works not only for operators, but also for routines:
\begin{verbatim}
if 2 == sqrt(4 | 9 | 16) {
say "YaY";
}
\end{verbatim}
Junctions can be very handy to perform fairly common tasks.
Imagine you have an array of numbers, and you want to
check that all of them are non-negative. In most
programming languages, you would write a loop to
iterate over the array and check each item in turn.
In Raku, this is very simple:
\begin{verbatim}
my @values = get-values();
if all(@values) >= 0 { ... }
\end{verbatim}
Or, if you want to avoid a division by zero exception:
\begin{verbatim}
my @values = get-values();
perform-division(@values) if none(@values) == 0;
\end{verbatim}
\index{junction}
\section{Subroutine Named and Optional Parameters}
The subroutines that we have seen so far used \emph{positional}
parameters, i.e., parameters whose binding with the subroutine
call arguments rely on their order within the list of
arguments and in the signature. This is usually fine when
the number of arguments passed to the subroutine is small
(say, three or less).
\index{positional parameter}
\index{parameter!positional}
\index{signature}
When the subroutine signature becomes longer, using positional
arguments might become cumbersome and error-prone.
\subsection{Named Parameters}
\label{named_parameters}
\index{named parameter}
\index{parameter!named}
Named arguments may be supplied in any order: the name of
the parameter is bound to the argument having the same
name. For example:
\begin{verbatim}
sub divide (:$dividend, :$divisor where $divisor != 0) {
return $dividend/$divisor;
}
say divide(dividend => 2048, divisor => 128); # -> 16
# or:
say divide(divisor => 128, dividend => 2048); # -> 16
\end{verbatim}
The arguments are supplied at the subroutine call as a list of
pairs using the pair-constructor syntax. In the signature,
the parameters are retrieved with the
so-called colon-pair syntax: the \verb'$dividend' parameter is
bound to the value of the pair whose key is ``dividend'' (2048),
and \verb'$divisor' is similarly bound to 128, irrespective of
the order of the arguments in the subroutine call.
\index{colon-pair syntax}
\index{pair constructor}
These named parameters are especially useful when the number of
arguments is large. For example, we haven't covered
object-oriented programming yet (see Chapter~\ref{objects}),
but this is how we could create an object of the
(user-defined) Rectangle class:
\begin{verbatim}
my $rect = Rectangle.new(
origin_x => 100,
origin_y => 200,
width => 23,
length => 42,
color => 'black'
);
\end{verbatim}
Clearly, using five positional parameters would be unpractical.
\subsection{Optional Parameters}
\index{optional parameter}
\index{parameter!optional}
\index{variadic subroutine}
\index{signature}
Sometimes, the actual number of arguments is not known
in advance: for example, a subroutine may be called with
a variable number of arguments. Such a subroutine is said
to be \emph{variadic}. You can define a parameter to be
optional by postfixing it with a question mark in the
subroutine signature:
\begin{verbatim}
sub my-sub($x, $y?) { # simple optional parameter
if $y.defined {
say "The second parameter has been supplied and defined";
} else {
say "The second parameter has not been supplied";
}
# ...
}
\end{verbatim}
\index{positional parameter}
\index{parameter!positional}
When using positional parameters, the optional parameters
always have to be the last ones in the list (after the
mandatory ones).
A parameter can also be made optional by supplying a
{\bf default value}:
\index{default value}
\index{default parameter}
\index{parameter!default value}
\begin{verbatim}
sub my-log($number, $base = e) { # e is a predefined constant
# $base is an optional parameter
return log($number) / log($base);
}
say my-log(4); # Natural log (base e) -> 1.38629436111989
say my-log(32, 2); # Log base 2 -> 5
say my-log(100, 10); # Common log (base 10) -> 2
\end{verbatim}
Here, if the second argument is not supplied, the default
value (\emph{e}) is used instead. Conversely, if there
is a second argument, it {\bf overrides} the default value.
\index{slurpy parameters}
\index{parameter!slurpy}
\label{slurpy_parameters}
Sometimes, having optional or default parameters is not good
enough. For example, the subroutine may have to process a list
containing any number of values. For situations like this,
you can use a \emph{slurpy parameter}, i.e., a kind of array
placed at the end of the parameter list that will slurp up
all the remaining arguments. This kind of slurpy parameter
uses the ``\verb"*@"'' twigil. In the following example,
the subroutine takes one mandatory parameter (the first
number of the list) and a list of additional arguments
that will be stored in the \verb'@rest' array:
\index{twigil}
\begin{verbatim}
sub my-sum($first-num, *@rest) {
say @rest; # -> [3 4 5 12 17]
return $first-num + [+] @rest;
}
say my-sum 1, 3, 4, 5, 12, 17; # -> 42
\end{verbatim}
Some further examples of slurpy parameters have been
provided in Section~\ref{sol_exercise_queue}.
\section{Word Frequency Analysis}
\label{analysis}
Now, let's get to the case study.
As usual, you should at least attempt the exercises
before you read the suggested solutions, which are
provided in the following sections of this chapter.
\begin{exercise}
Write a program that reads a file, breaks each line into
words, strips whitespace and punctuation from the words, and
converts them to lowercase.
\end{exercise}
\begin{exercise}
\index{Project Gutenberg}
Go to Project Gutenberg (\url{http://gutenberg.org}) and download
your favorite out-of-copyright book in plain text format.
\index{plain text}
\index{text!plain}
Modify your program from the previous exercise to read the book
you downloaded, skip over the header information at the beginning
of the file, and process the rest of the words as before.
Then modify the program to count the total number of words in
the book, and the number of times each word is used.
\index{word frequency}
\index{frequency!word}
Print the number of different words used in the book. Compare
different books by different authors, written in different eras.
Which author uses the most extensive vocabulary?
\end{exercise}
\begin{exercise}
Modify the program from the previous exercise to print the
20 most frequently used words in a given book.
\end{exercise}
\begin{exercise}
Modify the previous program to read a word list (see
Section~\ref{wordlist}) and then print all the words in the book that
are not in the word list. How many of them are typos? How many of
them are common words that {\em should} be in the word list, and how
many of them are really obscure?
\end{exercise}
\section{Random Numbers}
\index{random number}
\index{number, random}
\index{deterministic}
\index{pseudorandom}
Given the same inputs, most computer programs generate the same
outputs every time, so they are said to be {\bf deterministic}.
Determinism is usually a good thing, since we expect the same
calculation to yield the same result. For some applications, though,
we want the computer to be unpredictable. Games are an obvious
example, but there are more.
Making a program truly nondeterministic turns out to be difficult,
but there are ways to make it at least seem nondeterministic. One of
them is to use algorithms that generate {\bf pseudorandom} numbers.
Pseudorandom numbers are not truly random because they are generated
by a deterministic computation, but just by looking at the numbers it
is all but impossible to distinguish them from random.
Raku provides functions such as {\tt rand} that generate
pseudorandom numbers (which we will simply call ``random''
numbers from here on).
\index{rand function}
\index{function!rand}
The function {\tt rand} returns a random number (of {\tt Num}
type) between 0.0 and 1.0 (including 0.0 but not 1.0). Each time you
call {\tt rand}, you get the next number in a long series. To see a
sample, run this loop in the REPL:
\begin{verbatim}
say rand for 1..5;
\end{verbatim}
Used as a method, {\tt rand} returns a random number between
0.0 and the value of the invocant. For example, {\tt 10.rand}
returns a random number between 0 and 10 (10 not included). You
might try it as a one-liner:
\index{one-liner mode}
\begin{verbatim}
$ raku -e 'say 10.rand for 1..5'
8.23987158729588
9.83276889381497
2.52313276833335
3.44713459548771
1.82329894347025
\end{verbatim}
You should hopefully get a different output than I did. If
you want to run such a one-liner under Windows, remember
that you'll need to replace single quotes with double quotes.
To obtain random integers between 1 and 10, you may use
the {\tt Int} and {\tt rand} methods:
\index{Int function or method}
\begin{verbatim}
$ raku -e 'say 10.rand.Int + 1 for 1..5'
5
10
1
6
3
\end{verbatim}
The {\tt pick} function or method takes a number \verb'$count' and a list as arguments and returns \verb'$count' items chosen at
random and without repetition. For example:
\index{pick function or method}
\begin{verbatim}
> say <1 3 4 5 7 9 10 12 14 42>.pick(5);
(5 42 3 4 7)
> say pick 5, <1 3 4 5 7 9 10 12 14 42>;
(42 12 5 1 9)
\end{verbatim}
If \verb'$count' if greater than or equal to the number of
items of the list, then all elements from the list are returned
in a random sequence.
To obtain random unique integers in a range, you might use
{\tt pick} on a range:
\begin{verbatim}
> say pick 5, 1..20;
(5 3 6 18 7)
> say (1..20).pick(5);
(20 4 18 2 7)
\end{verbatim}
If you don't specify the number of random numbers, you'll get one
random pick:
\begin{verbatim}
> say (1..20).pick;
19
\end{verbatim}
%
\begin{exercise}
\index{histogram!random choice}
Write a function named \verb"choose_from_hist" that takes
a histogram as defined in Section~\ref{histogram} and returns a
random value from the histogram, chosen with probability
in proportion to frequency. For example, for the three
items: \verb"('a', 'a', 'b')", your function should
return \verb"'a'" with probability $2/3$ and \verb"'b'"
with probability $1/3$.
\end{exercise}
\section{Word Histogram}
You should attempt the previous exercises before you go on.
For the purpose of presenting the solutions to the above
exercises, I've used the plain text of {\em Emma} (1816), the
novel by Jane Austen, downloaded from the Gutenberg project
(\url{http://www.gutenberg.org/files/158/158-0.txt}) and
saved in a file called \emph{emma.txt} \footnote{The Gutenberg
project slightly modified the format of this file after this
chapter was written. To avoid any problem associated with this
change (and any other future changes), you can download the
original file from my Github repository:
\url{https://github.com/LaurentRosenfeld/think_raku/blob/master/Supplementary/emma.txt}.} Use the same
text if you want to compare your solutions and results
with mine.
\index{Austin, Jane}
\index{Emma}
\index{Project Gutenberg}
Here is a program that reads the \emph{emma.txt} file and
builds a histogram of the words in the file:
\index{histogram!word frequencies}
\begin{verbatim}
my %histogram;
my $skip = True; # flag to skip the header
sub process-line(Str $line is copy) {
if defined index $line, "*END*THE SMALL PRINT!" {
$skip = False ;
return;
}
return if $skip;
$line ~~ s:g/<[-']>/ /; # Replacing dashes and apostrophes with spaces
$line ~~ s:g/<[;:,!?.()"_`]>//; # removing punctuation symbols
$line = $line.lc; # setting string to lower case
for $line.words -> $word {
%histogram{$word}++;
}
$skip = True if $line ~~ /^finis$/;
}
process-line $_ for "emma.txt".IO.lines;
\end{verbatim}
\index{case!lower}
%
The program reads each line of the \emph{emma.txt} file and, for
each line, calls \verb"process-line".
\index{accumulator!histogram}
The \verb"process-line" subroutine skips the header lines
(i.e., all the lines until a line containing the string
``*END*THE SMALL PRINT!''
is met \footnote{This line is no longer present in the new version of
the Emma file currently on the Gutenberg project site. So this will
work only with my version of this file on my Github repository:
\url{https://github.com/LaurentRosenfeld/think_raku/blob/master/Supplementary/emma.txt}.}). It replaces dashes and apostrophes with spaces, removes
various punctuation characters, and sets the line to lower case.
Finally, it splits the line into individual words that are
stored and counted with an accumulator in the \verb'%histogram'
hash.
To know whether the program is doing something like what
it is supposed to do, we can display a few entries of the
\verb'%histogram' hash:
\begin{verbatim}
# displaying 20 lines of the histogram
my $count;
for %histogram -> $pair {
say sprintf "%-24s %d", $pair.key, $pair.value;
$count++;
last if $count > 20;
}
\end{verbatim}
This prints out the following output:
\begin{verbatim}
embarrassing 1
hows 1
appealed 2
bestow 2
articulate 1
demands 2
closely 1
dull 9
hearts 1
column 1
possesses 1
attributed 1
jumped 2
forwards 2
wittier 2
expert 2
attractive 2
asserted 2
oftentimes 1
fancy 38
finds 1
\end{verbatim}
To count the total number of words in the file, we can add up
the values in the histogram:
\begin{verbatim}
my $word_count = [+] %histogram.values;
say "There are $word_count words in the book.";
\end{verbatim}
%
The number of different words is just the number of items in
the hash:
\begin{verbatim}
my $distinct-words = %histogram.elems;
say "There are $distinct-words distinct words in the book.";
\end{verbatim}
%
Note that you could reduce the above to one code line by
interpolating a code block within the output string:
\index{interpolating a code block in a string}
\begin{verbatim}
say "There are {%histogram.elems} distinct words in the book."
\end{verbatim}
%
And the results:
\begin{verbatim}
There are 161991 words in the book.
There are 7110 distinct words in the book.
\end{verbatim}
%
\section{Most Common Words}
\label{most_common_words}
To find the most common words in \verb'emma.txt', we can
sort the \verb'%histogram' hash according to the values
(word frequencies) and retrieve the 10~most frequent words
into an array.
\index{sort}
\begin{verbatim}
my @most_commons = (sort { %histogram{$^b} cmp %histogram{$^a} },
%histogram.keys)[0..9];
say $_ for map { "$_ \t%histogram{$_} "}, @most_commons;
\end{verbatim}
The {\tt sort} functions receives the keys of the histogram and
its comparison function compares the values associated with
those keys. Since we use the key \verb'$^b' before the key
\verb'$^a', the sort will produce a reverse (descending) sort
order. The whole {\tt sort} procedure is placed within parentheses,
so that the subscript range {\tt [0..9]} acts as a slice on
the list produced by {\tt sort} and retains only the first 10~most
frequent words. The result is stored into the
\verb'@most_commons' array. The next code line just reprocesses
the array to display the words and their respective frequency.
\index{slice}
This displays the following output:
\begin{verbatim}
to 5241
the 5205
and 4897
of 4295
i 3192
a 3130
it 2529
her 2490
was 2400
she 2364
\end{verbatim}
%
If you want to see more interesting words, you might, as a
further exercise, filter the histogram and retain only
words that have more than, say, four letters.
The \verb'@most_commons ' temporary array is not really needed.
We could do the whole thing in a single statement:
\begin{verbatim}
say $_ for map { "$_ \t%histogram{$_} "},
(sort { %histogram{$^b} cmp %histogram{$^a} },
%histogram.keys)[0..9];
\end{verbatim}
This is an example of data pipeline (or stream) programming. Such a statement
needs to be read from bottom to top and from right to left. The
first step is \verb'%histogram.keys', which produces a list
of the histogram keys; this list is fed into the sort statement
to produce a list of the keys sorted (in descending order) according
to their values; once this whole part between parentheses is
completed, the subscript range \verb'[0..9]' retains the
10~most frequent words and feeds them into the map statement,
which produces the list of words and frequencies for the final
output.
Let me add one word of caution here: sorting the histogram by
values and picking up the top 10 to get the most frequent
words is probably the easiest way to solve the problem and the
shortest code to do it. That's the reason I have used this
solution here. But it is not the most efficient solution
from the standpoint of the run time, because it involves the
cost of sorting a relatively large data set, whereas we are
using only a small part of the sorted data. There are some
better algorithms to do that from the standpoint of runtime
efficiency, but they are more complicated. So, there is a
tradeoff here between coding efficiency and performance.
Assuming this is code that we want to run only once, I have
chosen to favor coding efficiency.
\index{data pipeline programming}
\section{Optional Parameters}
\index{optional parameter}
\index{parameter!optional}
We saw earlier in this chapter that subroutines can
take optional parameters. We can use this functionality to
write a subroutine that prints the most common words in the
\verb'%histogram' hash extracted from \emph{emma.txt}.
\begin{verbatim}
display-most-common(%histogram, 5);
sub display-most-common (%hist, Int $num = 10) {
say $_ for map { "$_ \t%hist{$_} "},
(sort { %hist{$^b} cmp %hist{$^a} },
%hist.keys)[0..$num - 1];
}
\end{verbatim}
This will display the five top words of the list above
(Section~\ref{most_common_words}). If we call it without
the second parameter:
\begin{verbatim}
display-most-common(%histogram);
\end{verbatim}
we will get the 10~most common words (same list as the one
shown in Section~\ref{most_common_words} above), because
the default value for the second parameter is 10.
\section{Hash Subtraction}
\label{hashsub}
\index{hash!subtraction}
\index{subtraction!hash}
Finding the words from \emph{emma.txt} that are not in the word list
{\tt words.txt} is a problem you might recognize as set
subtraction; that is, we want to find all the words from one
set (the words in \emph{emma.txt}) that are not in the other (the
words in \emph{words.txt}).
{\tt subtract} takes hashes \verb'%main' and \verb'%dict' and
returns a new hash that contains all the keys from \verb'%main'
that are not in \verb'%dict'. Since we don't really care about
the values, we set them all to 1:
\begin{verbatim}
sub subtract (%main, %dict) {
my %difference;
for %main.keys -> $word {
%difference{$word} = 1 unless %dict{$word}:exists;
}
return %difference;
}
\end{verbatim}
%
To find the words in \emph{emma.txt} that are not in \emph{words.txt},
we need to load the word list into a hash and to call
{\tt subtract}, passing the two hashes as arguments. We also
add some code to print the first 20~words not found in the
word list:
\begin{verbatim}
my %word-list = map { $_ => 1 }, "words.txt".IO.lines;
my %unknown-words = subtract(%histogram, %word-list);
say %unknown-words.keys.head(20);
\end{verbatim}
%
\index{head}
Notice that rather than using a subscript slice, I have used
here the {\tt head} method to print out the first 20~words of
the list. This is just another way to get the first ``n'' items
of a list or array. There is also a {\tt tail} method to
retrieve the last ``n'' items of a list or an array.
\index{head method}
\index{tail method}
Here are some of the results from {\em Emma}:
\begin{verbatim}
(penetrated unsullied bateses outstepped particularity experienced
italian sunday solicitously blockhead unbleached ult 26th
christian 7th untouched iii greensward houseroom tete)
\end{verbatim}
%
Some of these words are names and ordinal numbers. Others
are rare or no longer in common use. But a few are common
words that should really be in the list!
Note that I have used a hash (\verb'%unknown-words') here to
store the words of \emph{emma.txt} not found in the word list.
Since we are using the data only to print a sample of 20 words,
we could have used an array as well.
\section{Constructing New Operators}
\label{operator_construction}
\index{hash!subtraction}
\index{creating new operators}
\index{new operators!creating}
\index{overload operators}
\index{operator!overloading}
\index{operator construction}
Learning about hash subtraction is an excellent
opportunity to introduce a very interesting functionality of
Raku: the capacity to construct new operators or to
redefine existing ones.
Since we are subtracting two lists of words, it is tempting to
use the minus sign to denote this subtraction operation. It
is very easy to create such an operator in Raku:
\begin{verbatim}
multi sub infix:<-> (%main, %dict) {
my %difference;
for %main.keys -> $word {
%difference{$word} = 1 unless %dict{$word}:exists;
}
return %difference;
}
\end{verbatim}
\index{infix}
\index{subroutine signature}
\index{signature}
\index{multi!subroutine}
Compared to the definition of the {\tt subtract} subroutine, the
only differences are in the header line. We will cover the
details in a later chapter, but let us briefly explain here.
Most Raku operators are defined as ``multi'' subroutines,
i.e., subroutines that can be defined several times with
different signatures and bodies; Raku will know which one to use depending
on the signature. Here we define the minus operator as a
multi subroutine whose parameters are two hashes; this will be
enough for the Raku compiler to know that we don't want
the regular subtraction between numerical values, but something
else that applies to two hashes. The minus operator is defined as
``infix,'' which means that the operator will be placed
between its two operands.
Calling this new operator is now just as easy as calling
the regular subtraction operator between numbers; we just need
to use two hashes as operands:
\begin{verbatim}
my %unknown-words = %histogram - %word-list;
\end{verbatim}
The rest of the program works just as before.
\index{extending the language}
This ease or creating new operators is one of the facilities
offered by Raku to \emph{extend the language} from within itself.
We'll come back to that and other ways to extend the language
in later chapters.
\label{fact_operator}
\index{factorial}
\index{factorial!operator}
As an exercise, write a multi subroutine that creates the new
postfix ``!'' operator for computing the factorial of an integer:
\index{factorial}
\begin{verbatim}
say 5!; # -> 120
\end{verbatim}
%
Also try to think about how you would test this new ``!'' operator
against various input values.