/
S02-bits.pod
4948 lines (3783 loc) · 210 KB
/
S02-bits.pod
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
=encoding utf8
=head1 TITLE
Synopsis 2: Bits and Pieces
=head1 VERSION
Created: 10 Aug 2004
Last Modified: 16 Oct 2015
Version: 296
This document summarizes Apocalypse 2, which covers small-scale lexical
items and typological issues. (These Synopses also contain updates to
reflect the evolving design of Perl 6 over time, unlike the Apocalypses,
which are frozen in time as "historical documents". These updates are not
marked--if a Synopsis disagrees with its Apocalypse, assume the Synopsis is
correct.)
=head1 One-pass parsing
To the extent allowed by sublanguages' parsers, Perl is parsed using a
one-pass, predictive parser. That is, lookahead of more than one "longest
token" is discouraged. The currently known exceptions to this are where the
parser must:
=over 4
=item *
Locate the end of interpolated expressions that begin with a sigil and might
or might not end with brackets.
=item *
Recognize that a reduce operator is not really beginning a C<[...]>
composer.
=back
One-pass parsing is fundamental to knowing exactly which language you are
dealing with at any moment, which in turn is fundamental to allowing
unambiguous language mutation in any desired direction. (Generic languages
are allowed, but only if intended; accidentally generic languages lead to
loss of linguistic identity and integrity. This is the hard lesson of
Perl 5's source filters and other multi-pass parsing mistakes.)
=head1 Lexical Conventions
=head2 Unicode Semantics
In the abstract, Perl is written in Unicode, and has consistent Unicode
semantics regardless of the underlying text representations. By default
Perl presents Unicode in "NFG" formation, where each grapheme counts as one
character. A grapheme is what the novice user would think of as a character
in their normal everyday life, including any diacritics.
Perl can count Unicode line and paragraph separators as line markers, but
that behavior had better be configurable so that Perl's idea of line numbers
matches what your editor thinks about Unicode lines.
Unicode horizontal whitespace is counted as whitespace, but it's better not
to use thin spaces where they will make adjoining tokens look like a single
token. On the other hand, Perl doesn't use indentation as syntax, so you
are free to use any amount of whitespace anywhere that whitespace makes
sense. Comments always count as whitespace.
=head2 Bracketing Characters
For some syntactic purposes, Perl distinguishes bracketing characters from
non-bracketing. Bracketing characters are defined as any Unicode characters
with either bidirectional mirrorings or Ps/Pe/Pi/Pf properties.
In practice, though, you're safest using matching characters with
Ps/Pe/Pi/Pf properties, though ASCII angle brackets are a notable exception,
since they're bidirectional but not in the Ps/Pe/Pi/Pf sets.
Characters with no corresponding closing character do not qualify as opening
brackets. This includes the second section of the Unicode BidiMirroring
data table.
If a character is already used in Ps/Pe/Pi/Pf mappings, then any entry in
BidiMirroring is ignored (both forward and backward mappings). For any
given Ps character, the next Pe codepoint (in numerical order) is assumed to
be its matching character even if that is not what you might guess using
left-right symmetry. Therefore C<U+298D> (C<⦍>) maps to C<U+298E> (C<⦎>), not C<U+2990> (C<⦐>),
and C<U+298F> (C<⦏>) maps to C<U+2990> (C<⦐>), not C<U+298E> (C<⦎>). Neither C<U+298E> (C<⦎>) nor
C<U+2990> (C<⦐>) are valid bracket openers, despite having reverse mappings in the
BidiMirroring table.
The C<U+301D> (C<〝>) codepoint has two closing alternatives, C<U+301E> (C<〞>) and
C<U+301F> (C<〟>); Perl 6 only recognizes the one with lower code point number,
C<U+301E> (C<〞>), as the closing brace. This policy also applies to new
one-to-many mappings introduced in the future.
However, many-to-one mappings are fine; multiple opening characters may map
to the same closing character. For instance, C<U+2018> (C<‘>), C<U+201A> (C<‚>), and
C<U+201B> (C<‛>) may all be used as the opener for the C<U+2019> (C<’>) closer.
Constructs that count openers and closers assume that only the given opener
is special. That is, if you open with one of the alternatives, all other
alternatives are treated as non-bracketing characters within that construct.
=head2 Multiline Comments
Pod sections may be used reliably as multiline comments in Perl 6. Unlike
in Perl 5, Pod syntax now lets you use C<=begin comment> and C<=end comment>
to delimit a Pod block correctly without the need for C<=cut>. (In fact,
C<=cut> is now gone.) The format name does not have to be C<comment> -- any
unrecognized format name will do to make it a comment. (However, bare
C<=begin> and C<=end> probably aren't good enough, because all comments in
them will show up in the formatted output.)
We have single paragraph comments with C<=for comment> as well. That lets
C<=for> keep its meaning as the equivalent of a C<=begin> and C<=end>
combined. As with C<=begin> and C<=end>, a comment started in code reverts
to code afterwards.
Since there is a newline before the first C<=>, the Pod form of comment
counts as whitespace equivalent to a newline. See S26 for more on embedded
documentation.
=head2 Single-line Comments
Except within a quote literal, a C<#> character always introduces a comment
in Perl 6. There are two forms of comment based on C<#>. Embedded comments
require the C<#> to be followed by a backtick (C<`>) plus one or more
opening bracketing characters.
All other uses of C<#> are interpreted as single-line comments that work
just as in Perl 5, starting with a C<#> character and ending at the
subsequent newline. They count as whitespace equivalent to newline for
purposes of separation. Unlike in Perl 5, C<#> may I<not> be used as the
delimiter in quoting constructs.
=head2 Embedded Comments
Embedded comments are supported as a variant on quoting syntax, introduced
by C<#`> plus any user-selected bracket characters (as defined in
L</Bracketing Characters> above):
say #`( embedded comment ) "hello, world!";
$object\#`{ embedded comments }.say;
$object\ #`「
embedded comments
」.say;
Brackets may be nested, following the same policy as ordinary quote
brackets.
There must be no space between the C<#`> and the opening bracket character.
(There may be the I<visual appearance> of space for some double-wide
characters, however, such as the corner quotes above.)
For multiline comments it is recommended (but not required) to use two or
more brackets both for visual clarity and to avoid relying too much on
internal bracket counting heuristics when commenting code that may
accidentally miscount single brackets:
#`{{
say "here is an unmatched } character";
}}
However, it's sometimes better to use Pod comments because they are
implicitly line-oriented.
=head2 User-selected Brackets
For all quoting constructs that use user-selected brackets, you can open
with multiple identical bracket characters, which must be closed by the same
number of closing brackets. Counting of nested brackets applies only to
pairs of brackets of the same length as the opening brackets:
say #`{{
This comment contains unmatched } and { { { { (ignored)
Plus a nested {{ ... }} pair (counted)
}} q<< <<woot>> >> # says " <<woot>> "
Note however that bare circumfix or postcircumfix C<<< <<...>> >>> is not a
user-selected bracket, but the ASCII variant of the C<< «...» >>
interpolating word list. Only C<#`> and the C<q>-style quoters (including
C<m>, C<s>, C<tr>, and C<rx>) enable subsequent user-selected brackets.
=head2 Unspaces
Some languages such as C allow you to escape newline characters to combine
lines. Other languages (such as regexes) allow you to backslash a space
character for various reasons. Perl 6 generalizes this notion to any kind
of whitespace. Any contiguous whitespace (including comments) may be hidden
from the parser by prefixing it with C<\>. This is known as the "unspace".
An unspace can suppress any of several whitespace dependencies in Perl. For
example, since Perl requires an absence of whitespace between a noun and a
postfix operator, using unspace lets you line up postfix operators:
%hash\ {$key}
@array\ [$ix]
$subref\($arg)
As a special case to support the use above, a backslash where a postfix is
expected is considered a degenerate form of unspace. Note that whitespace
is not allowed before that, hence
$subref \($arg)
is a syntax error (two terms in a row). And
foo \($arg)
will be parsed as a list operator with a C<Capture> argument:
foo(\($arg))
However, other forms of unspace may usefully be preceded by whitespace.
(Unary uses of backslash may therefore never be followed by whitespace or
they would be taken as an unspace.)
Other postfix operators may also make use of unspace:
$number\ ++;
$number\ --;
1+3\ i;
$object\ .say();
$object\#`{ your ad here }.say
Another normal use of a you-don't-see-this-space is typically to put a
dotted postfix on the next line:
$object\ # comment
.say
$object\#`[ comment
].say
$object\
.say
But unspace is mainly about language extensibility: it lets you continue the
line in any situation where a newline might confuse the parser, regardless
of your currently installed parser. (Unless, of course, you override the
unspace rule itself...)
Although we say that the unspace hides the whitespace from the parser, it
does not hide whitespace from the lexer. As a result, unspace is not
allowed within a token. Additionally, line numbers are still counted if the
unspace contains one or more newlines. Since Pod chunks count as whitespace
to the language, they are also swallowed up by unspace. Heredoc boundaries
are suppressed, however, so you can split excessively long lines introducing
heredocs like this:
ok(q:to'CODE', q:to'OUTPUT', \
"Here is a long description", \ # --more--
todo(:parrøt<0.42>, :dötnet<1.2>));
...
CODE
...
OUTPUT
To the heredoc parser that just looks like:
ok(q:to'CODE', q:to'OUTPUT', "Here is a long description", todo(:parrøt<0.42>, :dötnet<1.2>));
...
CODE
...
OUTPUT
Note that this is one of those cases in which it is fine to have whitespace
before the unspace, since we're only trying to suppress the newline
transition, not all whitespace as in the case of postfix parsing. (Note
also that the example above is not meant to spec how the test suite works. )
=head2 Comments in Unspaces and vice versa
An unspace may contain a comment, but a comment may not contain an unspace.
In particular, end-of-line comments do not treat backslash as significant.
If you say:
#`\ (...
or
#\ `(...
it is an end-of-line comment, not an embedded comment. Write:
\ #`(
...
)
to mean the other thing.
=head2 Unspace disallowed within regexes
Within a regex, unspace is disallowed as too ambiguous with customary
backslashing conventions in surrounding cultures. Hence you must write an
explicit whitespace match some other way, such as with quotes or with a
C<\x20> or C<\c32> escape. On the other hand, while an unspace can start
with C<\#> in normal code, C<\#> within a regex is specifically allowed, and
is not taken as unspace, but matches a literal C<U+0023> (NUMBER SIGN). (Within
a character class, you may also escape whitespace with a backslash; the
restriction on unspace applies only at the normal pattern-matching level.)
=head2 Optional Whitespace and Exclusions
In general, whitespace is optional in Perl 6 except where it is needed to
separate constructs that would be misconstrued as a single token or other
syntactic unit. (In other words, Perl 6 follows the standard
I<longest-token> principle, or in the cases of large constructs, a I<prefer
shifting to reducing> principle. See L</Grammatical Categories> below for
more on how a Perl program is analyzed into tokens.)
This is an unchanging deep rule, but the surface ramifications of it change
as various operators and macros are added to or removed from the language,
which we expect to happen because Perl 6 is designed to be a mutable
language. In particular, there is a natural conflict between postfix
operators and infix operators, either of which may occur after a term. If a
given token may be interpreted as either a postfix operator or an infix
operator, the infix operator requires space before it. Postfix operators
may never have intervening space, though they may have an intervening dot.
If further separation is desired, an unspace or embedded comment may be used
as described above, as long as no whitespace occurs outside the unspace or
embedded comment.
For instance, if you were to add your own C<< infix:<++> >> operator, then
it must have space before it. The normal autoincrementing C<< postfix:<++>
>> operator may never have space before it, but may be written in any of
these forms:
$x++
$x\++
$x.++
$x\ ++
$x\ .++
$x\#`( comment ).++
$x\#`((( comment ))).++
$x\
.++
$x\ # comment
# inside unspace
.++
$x\ # comment
# inside unspace
++ # (but without the optional postfix dot)
$x\#`『 comment
more comment
』.++
$x\#`[ comment 1
comment 2
=begin Podstuff
whatever (Pod comments ignore current parser state)
=end Podstuff
comment 3
].++
=head3 Implicit Topical Method Calls
A consequence of the postfix rule is that (except when delimiting a quote or
terminating an unspace) a dot with whitespace in front of it is always
considered a method call on C<$_> where a term is expected. If a term is
not expected at this point, it is a syntax error. (Unless, of course, there
is an infix operator of that name beginning with dot. You could, for
instance, define a Fortranly C<< infix:<.EQ.> >> if the fit took you. But
you'll have to be sure to always put whitespace in front of it, or it would
be interpreted as a postfix method call instead.)
For example,
foo .method
and
foo
.method
will always be interpreted as
foo $_.method
but never as
foo.method
Use some variant of
foo\
.method
if you mean the postfix method call.
One consequence of all this is that you may no longer write a Num as C<42.>
with just a trailing dot. You must instead say either C<42> or C<42.0>. In
other words, a dot following a number can only be a decimal point if the
following character is a digit. Otherwise the postfix dot will be taken to
be the start of some kind of method call syntax. (The C<.123> form with a
leading dot is still allowed however when a term is expected, and is
equivalent to C<0.123> rather than C<$_.123>.)
=head2 Keywords and whitespace
One other spot where whitespace makes a difference is after various
keywords, such as control flow or other statement-oriented keywords. Such
keywords require whitespace after them. (Again, this is in the interests of
extensibility). So for instance, if you define a symbol that happens to be
the same as the keyword C<if>, you can still use it as a non-keyword, as
long as you don't put whitespace after it:
my \if = 42; say (if) if if; # prints 42
Here only the middle if of the second statement is taken as a keyword
because it has whitespace after it. The other mentions of C<if> do not, and
would be illegal were it not that the symbol is defined in this scope. If
you omit the definition, you'd get a message like this:
Whitespace required after keyword 'if'
at myfile:1
------> say (if⏏) if if;
Undeclared routine:
if used at line 1
=head1 Built-In Data Types
Perl 6 has an optional type system that helps you write safer code that
performs better. The compiler is free to infer what type information it can
from the types you supply, but it will not complain about missing type
information unless you ask it to.
Perl 6 is an OO engine, but you're not generally required to think in OO
when that's inconvenient. However, some built-in concepts such as
filehandles are more object-oriented in a user-visible way than in Perl 5.
=head2 The P6opaque Datatype
In support of OO encapsulation, there is a new fundamental data
representation: B<P6opaque>. External access to opaque objects is always
through method calls, even for attributes.
=head2 Name Equivalence of Types
Types are officially compared using name equivalence rather than structural
equivalence. However, we're rather liberal in what we consider a name. For
example, the name includes the version and authority associated with the
module defining the type (even if the type itself is "anonymous"). Beyond
that, when you instantiate a parametric type, the arguments are considered
part of the "long name" of the resulting type, so one C<Array of Int> is
equivalent to another C<Array of Int>. (Another way to look at it is that
the type instantiation "factory" is memoized.) Typename aliases are
considered equivalent to the original type. In particular, the C<Array of
Int> syntax is just sugar for C<Array:of(Int)>, which is the canonical form
of an instantiated generic type.
This name equivalence of parametric types extends only to parameters that
can be considered immutable (or that at least can have an immutable snapshot
taken of them). Two distinct classes are never considered equivalent even
if they have the same attributes because classes are not considered
immutable.
=head2 Properties on Objects
Perl 6 supports the notion of B<properties> on various kinds of objects.
Properties are like object attributes, except that they're managed by the
individual object rather than by the object's class.
According to S12, properties are actually implemented by a kind of mixin
mechanism, and such mixins are accomplished by the generation of an
individual anonymous class for the object (unless an identical anonymous
class already exists and can safely be shared).
=head3 Traits
Properties applied to objects constructed at compile-time, such as variables
and classes, are also called B<traits>. Traits cannot be changed at
run-time. Changes to run-time properties are done via mixin instead, so
that the compiler can optimize based on declared traits.
=head2 Types as Constraints
A variable's type is a constraint indicating what sorts of values the
variable may contain. More precisely, it's a promise that the object or
objects contained in the variable are capable of responding to the methods
of the indicated "role". See S12 for more about roles.
# $x can contain only Int objects
my Int $x;
=head2 Container Types
A variable may itself be bound to a container type that specifies how the
container works, without specifying what kinds of things it contains.
# $x is implemented by the MyScalar class
my $x is MyScalar;
Constraints and container types can be used together:
# $x can contain only Int objects,
# and is implemented by the MyScalar class
my Int $x is MyScalar;
Note that C<$x> is also initialized to the C<Int> type object. See below
for more on this.
=head2 Nil
There is a special value named C<Nil>. It means "there is no value here".
It is a little bit like the empty C<()> list, insofar as both represent an
absence of values, except that C<()> is defined and means "there are 0
arguments here if you're counting that low". The C<Nil> value represents
the absence of a value where there I<should> be one, so it does not
disappear in list context, but relies on something downstream to catch it or
blow up. C<Nil> also indicates a failed match.
Since method calls are performed directly on any object, C<Nil> can respond
to certain method calls. C<Nil.defined> returns C<False> (whereas
C<().defined> returns C<True>). C<Nil.so> also returns C<False>.
C<Nil.ACCEPTS> always returns C<Nil>. C<Nil.perl> and C<Nil.gist> return
C<'Nil'>. C<Nil.Stringy> and C<Nil.Str> throw a resumable warning that
returns a value of C<''> on resumption. C<Nil.Numeric> likewise throws a
resumable warning that returns 0 on resumption. Any undefined method call
on C<Nil> returns C<Nil>, so that C<Nil> propagates down method call chains.
Likewise any subscripting operation on C<Nil> returns C<Nil>.
Any attempt to change the C<Nil> value should cause an exception to be
thrown.
Assigning C<Nil> to any scalar container causes the container to throw out
any contents and restore itself to an uninitialized state (after which it
will appear to contain an object appropriate to the declared default of the
container, where C<Any> is the default default; the element may be simply
deleted if that's how the default can be represented in the structure).
Binding of C<Nil> with C<:=> simply puts Nil in the container. However,
binding C<Nil> to a parameter (C<::=> semantics) works more like assignment;
passing C<Nil> to a parameter with a default causes that parameter to be set
to its default value rather than an undefined value, as if the argument had
not been supplied.
Assigning C<Nil> to any entire composite container (such as an C<Array> or
C<Hash>) empties the container, resetting it back to an uninitialized state.
The container object itself then becomes undefined. (Assignment of C<()>
leaves it defined.)
=head2 Type Objects
C<my Dog $spot> by itself does not automatically call a C<Dog> constructor.
It merely assigns an undefined C<Dog> prototype object to C<$spot>:
my Dog $spot; # $spot is initialized with ::Dog
my Dog $spot = Dog; # same thing
$spot.defined; # False
say $spot; # "Dog()"
Any type name used as a value is the undefined prototype object of that
type, or I<type object> for short. See S12 for more on that.
Any type name in rvalue context is parsed as a single type value and expects
no arguments following it. However, a type object responds to the function
call interface, so you may use the name of a type with parentheses as if it
were a function, and any argument supplied to the call is coerced to the
type indicated by the type object. If there is no argument in the
parentheses, the type object returns itself:
my $type = Num; # type object as a value
$num = $type($string) # coerce to Num
To get a real C<Dog> object, call a constructor method such as C<new>:
my Dog $spot .= new;
my Dog $spot = $spot.new; # .= is rewritten into this
You can pass in arguments to the constructor as well:
my Dog $cerberus .= new(heads => 3);
my Dog $cerberus = $cerberus.new(heads => 3); # same thing
Just like L</Nil>, type objects do not disappear in list context, but rely
on something downstream to catch it or blow up. This allows type objects to
be assigned to scalars, but to disappear in non-scalar contexts.
=head2 Coercive type declarations
The parenthesized form of type coercion may be used in declarations where it
makes sense to accept a wider set of types but coerce them to a narrow type.
(This only works for one-way coercion, so you may not declare any C<rw>
parameter with a coercive type.) The type outside the parens indicates the
desired end result, and subsequent code may depend on it being that type.
The type inside the parens indicates the acceptable set of types that are
allowed to be bound or assigned to this location via coercion. If the wide
type is omitted, C<Any> is assumed. In any case, the wide type is only
indicative of permission to coerce; there must still be an available
coercion routine from the wide type to the narrow type to actually perform
the coercion.
sub foo (Str(Any) $y) {...}
sub foo (Str() $y) {...} # same thing
my Num(Cool) $x = prompt "Gimme a number";
Coercions may also be specified on the return type:
sub bar ($x, $y --> Int()) { return 3.5 } # returns 3
=head2 Containers of Native Types
If you say
my int @array is MyArray;
you are declaring that the elements of C<@array> are native integers, but
that the array itself is implemented by the C<MyArray> class. Untyped
arrays and hashes are still perfectly acceptable, but have the same
performance issues they have in Perl 5.
=head2 Methods on Arrays
To get the number of elements in an array, use the C<.elems> method. You
can also ask for the total string length of an array's elements, in
codepoints or graphemes, using these methods, C<.codes> or C<.chars>
respectively on the array. The same methods apply to strings as well.
(Note that C<.codes> is not well-defined unless you know which
canonicalization is in effect. Hence, it allows an optional argument to
specify the meaning exactly if it cannot be known from context.)
There is no C<.length> method for either arrays or strings, because
C<length> does not specify a unit.
=head2 Built-in Type Conventions
Built-in object types start with an uppercase letter. This includes
immutable types (e.g. C<Int>, C<Num>, C<Complex>, C<Rat>, C<Str>, C<Bit>,
C<Regex>, C<Set>, C<Block>, C<Iterator>), as well as mutable (container)
types, such as C<Scalar>, C<Array>, C<Hash>, C<Buf>, C<Routine>, C<Module>,
and non-instantiable Roles such as C<Callable> and C<Integral>.
Non-object (native) types are lowercase: C<int>, C<num>, C<complex>, C<rat>,
C<buf>, C<bit>. Native types are primarily intended for declaring compact
array storage, that is, a sequence of storage locations of the specified
type laid out in memory contiguously without pointer indirection. However,
Perl will try to make those look like their corresponding uppercase types if
you treat them that way. (In other words, it does autoboxing and
autounboxing as necessary. Note, however, that repeated autoboxing and
unboxing can make your program much slower, compared to a program that makes
consistent use of either native types or object types.)
=head3 The C<.WHICH> Method for Value Types
Some object types can behave as value types. Every object can produce a
"WHICH" value that uniquely identifies the object for hashing and other
value-based comparisons. Normal objects use some kind of unique ID as their
identity, but if a class wishes to behave as a value type, it can define a
C<.WHICH> method that makes different objects look like the same object if
they happen to have the same contents.
=head3 The C<ObjAt> Type
When we say that a normal object uses its location as its identity, we do
I<not> mean that it returns its address as a number. In the first place,
not all objects are in the same memory space (see the literature on NUMA,
for instance), and two objects should not accidentally have the same
identity merely because they were stored at the same offset in two different
memory spaces. We also do not want to allow accidental identity collisions
with values that really are numbers (or strings, or any other mundane value
type). Nor should we be encouraging people to think of object locations
that way in any case. So C<WHICH> still returns a value rather than another
object, but that value must be of a special C<ObjAt> type that prevents
accidental confusion with normal value types, and at least discourages
trivial pointer arithmetic.
Certainly, it is difficult to give a unique name to every possible address
space, let alone every possible address within every such a space. In the
absence of a universal naming scheme, it can only be made improbable that
two addresses from two different spaces will collide. A sufficiently large
random number may represent the current address space on output of an
C<ObjAt> to a different address space, or if serialized to YAML or XML.
(This extra identity component need not be output for debugging messages
that assume the current address space, since it will be the same big number
consistently, unless your process really is running under a NUMA.)
Alternately, if an object is being serialized to a form that does not
preserve object identity, there is no requirement to preserve uniqueness,
since in this case the object is really being translated to a value type
representation, and reconstituted on the other end as a different unique
object.
=head2 Variables Containing Undefined Values
A variable with a non-native type constraint may contain an I<undefined>
value such as a type object, provided the undefined value meets the type
constraint.
my Int $x = Int; # works
my Buf $x = Buf8; # works
Variables with native types do not support undefinedness: it is an error to
assign an undefined value to them:
my int $y = Int; # dies
Since C<num> can support the value C<NaN> but not the general concept of
undefinedness, you can coerce an undefined value like this:
my num $n = computation() // NaN;
Variables of non-native types start out containing a type object of the
appropriate type unless explicitly initialized to a defined value.
Any container's default may be overridden by the C<is default(VALUE)> trait.
If the container's contents are deleted, the value is notionally set to the
provided default value; this value may or may not be physically represented
in memory, depending on the implementation of the container. You should
officially not care about that (much).
=head2 The C<HOW> Method
Every object supports a C<HOW> function/method that returns the metaclass
instance managing it, regardless of whether the object is defined:
'x'.HOW.methods('x'); # get available methods for strings
Str.HOW.methods(Str); # same thing with the prototype object Str
HOW(Str).methods(Str); # same thing as function call
'x'.methods; # this is likely an error - not a meta object
Str.methods; # same thing
(For a prototype system (a non-class-based object system), all objects are
merely managed by the same meta object.)
=head2 Roles
Perl supports generic types through what are called "roles" which represent
capabilities or interfaces. These roles are generally not used directly as
object types. For instance all the numeric types perform the C<Numeric>
role, and all string types perform the C<Stringy> role, but there's no such
thing as a "Numeric" object, since these are generic types that must be
instantiated with extra arguments to produce normal object types. Common
roles include:
Stringy
Numeric
Real
Integral
Rational
Callable
Positional
Associative
Buf
Blob
=head2 C<Numeric> Types
Perl 6 intrinsically supports big integers and rationals through its system
of type declarations. C<Int> automatically supports promotion to arbitrary
precision, as well as holding C<Inf> and C<NaN> values. Note that C<Int>
assumes 2's complement arithmetic, so C<+^1 == -2> is guaranteed. (Native
C<int> operations need not support this on machines that are not natively
2's complement. You must convert to and from C<Int> to do portable bitops
on such ancient hardware.)
C<Num> must support the largest native floating point format that runs at
full speed. It may be bound to an arbitrary precision type, but by default
it is the same type as a native C<num>. See below.
C<Rat> supports extended precision rational arithmetic. Dividing two
C<Integral> objects using C<< infix:</> >> produces a C<Rat>, which is
generally usable anywhere a C<Num> is usable, but may also be explicitly
cast to C<Num>. (Also, if either side is C<Num> already, C<< infix:</> >>
gives you a C<Num> instead of a C<Rat>.)
C<Rat> and C<Num> both do the C<Real> role.
Lowercase types like C<int> and C<num> imply the native machine
representation for integers and floating-point numbers, respectively, and do
not promote to arbitrary precision, though larger representations are always
allowed for temporary values. Unless qualified with a number of bits,
C<int> and C<num> types represent the largest native integer and
floating-point types that run at full speed.
Because temporary values are biased in favor of correct semantics over
compact storage, native numeric operators that might overflow must come in
two variants, one which returns a guaranteed correct boxed value, and one of
which returns a guaranteed fast native value. By default the boxing variant
is selected (probably by virtue of hiding the native variants), but within a
given lexical scope, the C<use native> pragma will allow use of the
dangerous but fast variants instead. Arguments to the pragma can be more
specific about what types of return values are allowed, e.g. C<use native
'int';> and such. (The optimizer is also allowed to substitute such
variants when it can determine that the final destination would store
natively in any case, or that the variant could not possibly malfunction
given the arguments.) [Conjecture: we could allow an 'N' metaoperator to
select the native variant on a case by case basis.]
Numeric values in untyped variables use C<Int> and C<Num> semantics rather
than C<int> and C<num>. Literals, on the other hand, may default to native
storage formats if they reasonably can. We rely on the semantics of boxing
temporary values by default (see above) to maintain correct semantics; the
optimizer is of course allowed to box or unbox a literal at compile time (or
cache a boxed/unboxed version of the value) whenever it seems appropriate.
In any case, native literals should be preferred under C<use native>
semantics.
For pragmatic reasons, C<Rat> values are guaranteed to be exact only up to a
certain point. By default, this is the precision that would be represented
by the C<Rat64> type, which is an alias for C<Rational[Int,Uint64]>, which
has a numerator of C<Int> but is limited to a denominator of C<Uint64>
(which may or may not be implemented as a native C<uint64>, since small
representations may be desirable for small denominators). A C<Rat64> that
would require more than 64 bits of storage in the denominator is
automatically converted either to a C<Num> or to a lesser-precision C<Rat>,
at the discretion of the implementation. (Native types such as C<rat64>
limit the size of both numerator and denominator, though not to the same
size. The numerator should in general be twice the size of the denominator
to support user expectations. For instance, a C<rat8> actually supports
C<Rational[int16,uint8]>, allowing numbers like C<100.01> to be represented,
and a C<rat64>, defined as C<Rational[int128,uint64]>, can hold the number
of seconds since the Big Bang with attosecond precision. Though perhaps not
with attosecond accuracy...)
The limitation on C<Rat> values is intended to be enforced only on
user-visible types. Intermediate values used in the internal calculations
of C<Rat> operators may exceed this precision, or represent negative
denominators. That is, the temporaries used in calculating the new
numerator and denominator are (at least in the abstract) of C<Int> type.
After a new numerator and denominator are determined, any sign is forced to
be represented only by the numerator. Then if the denominator exceeds the
storage size of the unsigned integer used, the fraction is reduced via GCD.
If the resulting denominator is still larger than the storage size, then and
I<only> then may the precision be reduced to fit into a C<Rat> or C<Num>.
C<Rat> addition and subtraction should attempt to preserve the denominator
of the more precise argument if that denominator is an integral multiple of
the less precise denominator. That is, in practical terms, adding a column
of dollars and cents should generally end up with a result that has a
denominator of 100, even if values like 42 and 3.5 were added in. With
other operators, this guarantee cannot be made; in such cases, the user
should probably be explicitly rounding to a particular denominator anyway.
For applications that really need arbitrary precision denominators as well
as numerators at the cost of performance, C<FatRat> may be used, which is
defined as C<Rational[Int,Int]>, that is, as arbitrary precision in both
parts. There is no literal form for a C<FatRat>, so it must be constructed
using C<FatRat.new($nu,$de)>. In general, only math operators with at least
one C<FatRat> argument will return another C<FatRat>, to prevent accidental
promotion of reasonably fast C<Rat> values into arbitrarily slow C<FatRat>
values.
Although most rational implementations normalize or "reduce" fractions to
their smallest representation immediately through a GCD algorithm, Perl
allows a rational datatype to do so lazily at need, such as whenever the
denominator would run out of precision, but avoid the overhead otherwise.
Hence, if you are adding a bunch of C<Rat>s that represent, say, dollars and
cents, the denominator may stay 100 the entire way through. The C<.nu> and
C<.de> methods will return these unreduced values. You can use
C<$rat.=norm> to normalize the fraction. (This also forces the sign on the
denominator to be positive.) The C<.perl> method will produce a decimal
number if the denominator is a power of 10, or normalizable to a power of 10
(that is, having factors of only 2 and 5 (and -1)). Otherwise it will
normalize and return a rational literal of the form C<< <-47/3> >>.
Stringifying a rational via C<.gist> or C<.Str> returns an exact decimal
number if possible, and otherwise rounds off the repeated decimal based on
the size of the denominator. For full details see the documentation of
C<Rat.gist> in S32.
C<Num.Str> and C<Num.gist> both produce valid C<Num> literals, so they must
include the C<e> for the exponential.
say 1/5; # 0.2 exactly
say 1/3; # 0.333333
say <2/6>.perl
# <1/3>
say 3.14159_26535_89793
# 3.141592653589793 including last digit
say 111111111111111111111111111111111111111111111.123
# 111111111111111111111111111111111111111111111.123
say 555555555555555555555555555555555555555555555/5
# 111111111111111111111111111111111111111111111
say <555555555555555555555555555555555555555555555/5>.perl
# 111111111111111111111111111111111111111111111.0
say 2e2; # 200e0 or 2e2 or 200.0e0 or 2.0e2
=head2 Infinity and C<NaN>
Perl 6 by default makes standard IEEE floating point concepts visible, such
as C<Inf> (infinity) and C<NaN> (not a number). Within a lexical scope,
pragmas may specify the nature of temporary values, and how floating point
is to behave under various circumstances. All IEEE modes must be lexically
available via pragma except in cases where that would entail heroic efforts
to bypass a braindead platform.
The default floating-point modes do not throw exceptions but rather
propagate C<Inf> and C<NaN>. The boxed object types may carry more detailed
information on where overflow or underflow occurred. Numerics in Perl are
not designed to give the identical answer everywhere. They are designed to
give the typical programmer the tools to achieve a good enough answer most
of the time. (Really good programmers may occasionally do even better.)
Mostly this just involves using enough bits that the stupidities of the
algorithm don't matter much.
=head2 Strings, the C<Str> Type
A C<Str> type is a Unicode string object. It boxes a native C<str> (the
difference being in representation; a C<Str> is a P6opaque and as such you
may mix in to it, but this is not possible with a C<str>). A C<Str> functions
at grapheme level. This means that `.chars` should give the number of
graphemes, `.substr` should never cut a combining character in two, and so
forth. Both C<str> and C<Str> are immutable. Their exact representation in
memory is implementation defined, so implementations are free to use ropes
or other data structures internally in order to make concatenation, substring,
and so forth cheaper.
Implementation note: since Perl 6 mandates that C<Str> must view graphemes
as the fundamental unit rather than codepoints, this has some implications
regarding efficient implementation. It is suggested that all graphemes be
translated on input to unique grapheme numbers and represented as integers
within some kind of uniform array for fast substr access. For those
graphemes that have a precomposed form, use of that codepoint is suggested.
(Note that this means Latin-1 can still be represented internally with 8-bit
integers.)
For graphemes that have no precomposed form, a temporary private id should
be assigned that uniquely identifies the grapheme. If such ids are assigned
consistently throughout the process, comparison of two graphemes is no more
difficult than the comparison of two integers, and comparison of base
characters no more difficult than a direct lookup into the id-to-NFD table.
Obviously, any temporary grapheme ids must be translated back to some
universal form (such as NFD) on output, and normal precomposed graphemes may
turn into either NFC or NFD forms depending on the desired output.
Maintaining a particular grapheme/id mapping over the life of the process
may have some GC implications for long-running processes, but most processes
will likely see a limited number of non-precomposed graphemes.
Code wishing to work at a codepoint level instead of a grapheme level
should use the C<Uni> type, which has subclasses representing the various
Unicode normalization forms (namely, C<NFC>, C<NFD>, C<NFIC>, and C<NFKD>).
Note that C<ord> is defined as a codepoint level operation. Even though the
C<Str> may contain synthetics internally, these should never be exposed by
C<ord>; instead, the behaviour should be as if the C<Str> had been converted
to an C<NFC> and then the first element accessed (obviously, implementations
are free to do something far more efficient).
=head2 The C<Buf> Type
A C<Buf> is a stringish view of an array of integers, and has no Unicode or
character properties without explicit conversion to some kind of C<Str>.
(The C<buf8>, C<buf16>, C<buf32>, and C<buf64> types are the native
counterparts; native buf types are required to occupy contiguous memory for
the entire buffer.) Typically a C<Buf> is an array of bytes serving as a
buffer. Bitwise operations on a C<Buf> treat the entire buffer as a single
large integer. Bitwise operations on a C<Str> generally fail unless the
C<Str> in question can provide an abstract C<Buf> interface somehow.
Coercion to C<Buf> should generally invalidate the C<Str> interface. As a
generic role C<Buf> may be instantiated as any of C<buf8>, C<buf16>, or
C<buf32> (or as any type that provides the appropriate C<Buf> interface),
but when used to create a buffer C<Buf> is punned to a class implementing
C<buf8> (actually C<Buf[uint8]>).
Unlike C<Str> types, C<Buf> types prefer to deal with integer string
positions, and map these directly to the underlying compact array as
indices. That is, these are not necessarily byte positions--an integer
position just counts over the number of underlying positions, where one
position means one cell of the underlying integer type. Builtin string
operations on C<Buf> types return integers and expect integers when dealing
with positions. As a limiting case, C<buf8> is just an old-school byte
string, and the positions are byte positions. Note, though, that if you
remap a section of C<buf32> memory to be C<buf8>, you'll have to multiply
all your positions by 4.
=head3 Native C<buf> Types
These native types are defined based on the C<Buf> role, parameterized by
the native integer type it is composed of:
Name Is really
==== =========
buf1 Buf[bit]
buf8 Buf[uint8]
buf16 Buf[uint16]
buf32 Buf[uint32]
buf64 Buf[uint64]
There are no signed buf types provided as built-ins, but you may say
Buf[int8]
Buf[int16]
Buf[int32]