/
S02-bits.pod
3699 lines (2796 loc) · 147 KB
/
S02-bits.pod
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
=encoding utf8
=head1 TITLE
Synopsis 2: Bits and Pieces
=head1 AUTHOR
Larry Wall <larry@wall.org>
=head1 VERSION
Maintainer: Larry Wall <larry@wall.org>
Date: 10 Aug 2004
Last Modified: 20 Feb 2009
Number: 2
Version: 153
This document summarizes Apocalypse 2, which covers small-scale
lexical items and typological issues. (These Synopses also contain
updates to reflect the evolving design of Perl 6 over time, unlike the
Apocalypses, which are frozen in time as "historical documents".
These updates are not marked--if a Synopsis disagrees with its
Apocalypse, assume the Synopsis is correct.)
=head1 One-pass parsing
To the extent allowed by sublanguages' parsers, Perl is parsed using a
one-pass, predictive parser. That is, lookahead of more than one
"longest token" is discouraged. The currently known exceptions to
this are where the parser must:
=over 4
=item *
Locate the end of interpolated expressions that begin with a sigil
and might or might not end with brackets.
=item *
Recognize that a reduce operator is not really beginning a C<[...]> composer.
=back
=head1 Lexical Conventions
=over 4
=item *
In the abstract, Perl is written in Unicode, and has consistent Unicode
semantics regardless of the underlying text representations. By default
Perl presents Unicode in "NFG" formation, where each grapheme counts as
one character. A grapheme is what the novice user would think of as a
character in their normal everyday life, including any diacritics.
=item *
Perl can count Unicode line and paragraph separators as line markers,
but that behavior had better be configurable so that Perl's idea of
line numbers matches what your editor thinks about Unicode lines.
=item *
Unicode horizontal whitespace is counted as whitespace, but it's better
not to use thin spaces where they will make adjoining tokens look like
a single token. On the other hand, Perl doesn't use indentation as syntax,
so you are free to use any amount of whitespace anywhere that whitespace makes sense.
Comments always count as whitespace.
=item *
For some syntactic purposes, Perl distinguishes bracketing characters
from non-bracketing. Bracketing characters are defined as any Unicode
characters with either bidirectional mirrorings or Ps/Pe properties.
In practice, though, you're safest using matching characters with
Ps/Pe properties, though ASCII angle brackets are a notable exception,
since they're bidirectional but not in the Ps/Pe set.
Characters with no corresponding closing character do not qualify
as opening brackets. This includes the second section of the Unicode
BidiMirroring data table, as well as C<U+201A> and C<U+201E>.
If a character is already used in Ps/Pe mappings, then any entry
in BidiMirroring is ignored (both forward and backward mappings).
For any given Ps character, the next Pe codepoint (in numerical
order) is assumed to be its matching character even if that is not
what you might guess using left-right symmetry. Therefore C<U+298D>
maps to C<U+298E>, not C<U+2990>, and C<U+298F> maps to C<U+2990>,
not C<U+298E>. Neither C<U+298E> nor C<U+2990> are valid bracket
openers, despite having reverse mappings in the BidiMirroring table.
The C<U+301D> codepoint has two closing alternatives, C<U+301E> and C<U+301F>;
Perl 6 only recognizes the one with lower code point number, C<U+301E>,
as the closing brace. This policy also applies to new one-to-many
mappings introduced in the future.
=back
=head1 Whitespace and Comments
=over 4
=item *
POD sections may be used reliably as multiline comments in Perl 6.
Unlike in Perl 5, POD syntax now lets you use C<=begin comment>
and C<=end comment> delimit a POD block correctly without the need
for C<=cut>. (In fact, C<=cut> is now gone.) The format name does
not have to be C<comment> -- any unrecognized format name will do
to make it a comment. (However, bare C<=begin> and C<=end> probably
aren't good enough, because all comments in them will show up in the
formatted output.)
We have single paragraph comments with C<=for comment> as well.
That lets C<=for> keep its meaning as the equivalent of a C<=begin>
and C<=end> combined. As with C<=begin> and C<=end>, a comment started
in code reverts to code afterwards.
Since there is a newline before the first C<=>, the POD form of comment
counts as whitespace equivalent to a newline. See S26 for more on
embedded documentation.
=item *
Except within a string literal, a C<#> character always introduces a comment in
Perl 6. There are two forms of comment based on C<#>. Embedded
comments require the C<#> to be followed by one
or more opening bracketing characters.
All other uses of C<#> are interpreted as single-line comments that
work just as in Perl 5, starting with a C<#> character and
ending at the subsequent newline. They count as whitespace equivalent
to newline for purposes of separation. Unlike in Perl 5, C<#>
may I<not> be used as the delimiter in quoting constructs.
=item *
Embedded comments are supported as a variant on quoting syntax, introduced
by C<#> plus any user-selected bracket characters (as defined in
L</Lexical Conventions> above):
say #( embedded comment ) "hello, world!";
$object\#{ embedded comments }.say;
$object\ #「
embedded comments
」.say;
Brackets may be nested, following the same policy as ordinary quote brackets.
There must be no space between the C<#> and the opening bracket character.
(There may be the I<visual appearance> of space for some double-wide
characters, however, such as the corner quotes above.)
An embedded comment is not allowed as the first thing on the line.
#sub foo # line-end comment
#{ # ILLEGAL, syntax error
# ...
#}
If you wish to have a comment there, you must disambiguate it to
either an embedded comment or a line-end comment. You can put a
space in front of it to make it an embedded comment:
#sub foo # line end comment
#{ # okay, comment
... # extends
} # to here
Or you can put something other than a single C<#>
to make it a line-end comment. Therefore, if you are commenting out a
block of code using the line-comment form, we recommend that you use
C<##>, or C<#> followed by some whitespace, preferably a tab to keep
any tab formatting consistent:
##sub foo
##{ # okay
## ...
##}
# sub foo
# { # okay
# ...
# }
# sub foo
# { # okay
# ...
# }
However, it's often better to use pod comments because they are
implicitly line-oriented. And if you have an intelligent syntax
highlighter that will mark pod comments in a different color, there's
less visual need for a C<#> on every line.
=item *
For all quoting constructs that use user-selected brackets, you can open
with multiple identical bracket characters, which must be closed by the
same number of closing brackets. Counting of nested brackets applies only
to pairs of brackets of the same length as the opening brackets:
say #{{
This comment contains unmatched } and { { { { (ignored)
Plus a nested {{ ... }} pair (counted)
}} q<< <<woot>> >> # says " <<woot>> "
Note however that bare circumfix or postcircumfix C<<< <<...>> >>> is
not a user-selected bracket, but the ASCII variant of the C<< «...» >>
interpolating word list. Only C<#> and the C<q>-style quoters (including
C<m>, C<s>, C<tr>, and C<rx>) enable subsequent user-selected brackets.
=item *
Some languages such as C allow you to escape newline characters
to combine lines. Other languages (such as regexes) allow you to
backslash a space character for various reasons. Perl 6 generalizes
this notion to any kind of whitespace. Any contiguous whitespace
(including comments) may be hidden from the parser by prefixing it
with C<\>. This is known as the "unspace". An unspace can suppress
any of several whitespace dependencies in Perl. For example, since
Perl requires an absence of whitespace between a noun and a postfix
operator, using unspace lets you line up postfix operators:
%hash\ {$key}
@array\ [$ix]
$subref\($arg)
As a special case to support the use above, a backslash where
a postfix is expected is considered a degenerate form of unspace.
Note that whitespace is not allowed before that, hence
$subref \($arg)
is a syntax error (two terms in a row). And
foo \($arg)
will be parsed as a list operator with a C<Capture> argument:
foo(\($arg))
However, other forms of unspace may usefully be preceded by whitespace.
(Unary uses of backslash may therefore never be followed by whitespace
or they would be taken as an unspace.)
Other postfix operators may also make use of unspace:
$number\ ++;
$number\ --;
1+3\ i;
$object\ .say();
$object\#{ your ad here }.say
Another normal use of a you-don't-see-this-space is typically to put
a dotted postfix on the next line:
$object\ # comment
.say
$object\#[ comment
].say
$object\
.say
But unspace is mainly about language extensibility: it lets you continue
the line in any situation where a newline might confuse the parser,
regardless of your currently installed parser. (Unless, of course,
you override the unspace rule itself...)
Although we say that the unspace hides the whitespace from the parser,
it does not hide whitespace from the lexer. As a result, unspace is not
allowed within a token. Additionally, line numbers are still
counted if the unspace contains one or more newlines. A C<#> following
such a newline is always an end-of-line comment, as described above.
Since Pod chunks count as whitespace to the language, they are also
swallowed up by unspace. Heredoc boundaries are suppressed, however,
so you can split excessively long heredoc intro lines like this:
ok(q:to'CODE', q:to'OUTPUT', \
"Here is a long description", \ # --more--
todo(:parrøt<0.42>, :dötnet<1.2>));
...
CODE
...
OUTPUT
To the heredoc parser that just looks like:
ok(q:to'CODE', q:to'OUTPUT', "Here is a long description", todo(:parrøt<0.42>, :dötnet<1.2>));
...
CODE
...
OUTPUT
Note that this is one of those cases in which it is fine to have
whitespace before the unspace, since we're only trying to suppress
the newline transition, not all whitespace as in the case of postfix
parsing. (Note also that the example above is not meant to spec how
the test suite works. :)
=item *
An unspace may contain a comment, but a comment may not contain an unspace.
In particular, end-of-line comments do not treat backslash as significant.
If you say:
#\ (...
it is an end-of-line comment, not an embedded comment. Write:
\ #(
...
)
to mean the other thing.
=item *
In general, whitespace is optional in Perl 6 except where it is needed
to separate constructs that would be misconstrued as a single token or
other syntactic unit. (In other words, Perl 6 follows the standard
I<longest-token> principle, or in the cases of large constructs, a
I<prefer shifting to reducing> principle. See L</Grammatical Categories>
below for more on how a Perl program is analyzed into tokens.)
This is an unchanging deep rule, but the surface ramifications of it
change as various operators and macros are added to or removed from
the language, which we expect to happen because Perl 6 is designed to
be a mutable language. In particular, there is a natural conflict
between postfix operators and infix operators, either of which
may occur after a term. If a given token may be interpreted as
either a postfix operator or an infix operator, the infix operator
requires space before it. Postfix operators may never have intervening
space, though they may have an intervening dot. If further separation
is desired, an unspace or embedded comment may be used as described above, as long
as no whitespace occurs outside the unspace or embedded comment.
For instance, if you were to add your own C<< infix:<++> >> operator,
then it must have space before it. The normal autoincrementing
C<< postfix:<++> >> operator may never have space before it, but may
be written in any of these forms:
$x++
$x\++
$x.++
$x\ ++
$x\ .++
$x\#( comment ).++
$x\#((( comment ))).++
$x\
.++
$x\ # comment
# inside unspace
.++
$x\ # comment
# inside unspace
++ # (but without the optional postfix dot)
$x\#『 comment
more comment
』.++
$x\#[ comment 1
comment 2
=begin podstuff
whatever (pod comments ignore current parser state)
=end podstuff
comment 3
].++
A consequence of the postfix rule is that (except when delimiting a
quote or terminating an unspace) a dot with whitespace in front
of it is always considered a method call on C<$_> where a term is
expected. If a term is not expected at this point, it is a syntax
error. (Unless, of course, there is an infix operator of that name
beginning with dot. You could, for instance, define a Fortranly
C<< infix:<.EQ.> >> if the fit took you. But you'll have to be sure to
always put whitespace in front of it, or it would be interpreted as
a postfix method call instead.)
For example,
foo .method
and
foo
.method
will always be interpreted as
foo $_.method
but never as
foo.method
Use some variant of
foo\
.method
if you mean the postfix method call.
One consequence of all this is that you may no longer write a Num as
C<42.> with just a trailing dot. You must instead say either C<42>
or C<42.0>. In other words, a dot following a number can only be a
decimal point if the following character is a digit. Otherwise the
postfix dot will be taken to be the start of some kind of method call
syntax. (The C<.123> form with a leading
dot is still allowed however when a term is expected, and is equivalent
to C<0.123> rather than C<$_.123>.)
=back
=head1 Built-In Data Types
=over 4
=item *
In support of OO encapsulation, there is a new fundamental datatype:
B<P6opaque>. External access to opaque objects is always through method
calls, even for attributes.
=item *
Perl 6 has an optional type system that helps you write safer
code that performs better. The compiler is free to infer what type
information it can from the types you supply, but will not complain
about missing type information unless you ask it to.
=item *
Types are officially compared using name equivalence rather than
structural equivalence. However, we're rather liberal in what we
consider a name. For example, the name includes the version and
authority associated with the module defining the type (even if
the type itself is "anonymous"). Beyond that, when you instantiate
a parametric type, the arguments are considered part of the "long
name" of the resulting type, so one C<Array of Int> is equivalent to
another C<Array of Int>. (Another way to look at it is that the type
instantiation "factory" is memoized.) Typename aliases are considered
equivalent to the original type.
This name equivalence of parametric types extends only to parameters
that can be considered immutable (or that at least can have an
immutable snapshot taken of them). Two distinct classes are never
considered equivalent even if they have the same attributes because
classes are not considered immutable.
=item *
Perl 6 supports the notion of B<properties> on various kinds of
objects. Properties are like object attributes, except that they're
managed by the individual object rather than by the object's class.
According to S12, properties are actually implemented by a
kind of mixin mechanism, and such mixins are accomplished by the
generation of an individual anonymous class for the object (unless
an identical anonymous class already exists and can safely be shared).
=item *
Properties applied to objects constructed at compile-time, such as
variables and classes, are also called B<traits>. Traits cannot be
changed at run-time. Changes to run-time properties are done via
mixin instead, so that the compiler can optimize based on declared traits.
=item *
Perl 6 is an OO engine, but you're not generally required to think
in OO when that's inconvenient. However, some built-in concepts such
as filehandles will be more object-oriented in a user-visible way
than in Perl 5.
=item *
A variable's type is a constraint indicating what sorts of values the
variable may contain. More precisely, it's a promise that the object
or objects contained in the variable are capable of responding to the
methods of the indicated "role". See S12 for more about roles.
# $x can contain only Int objects
my Int $x;
A variable may itself be bound to a container type that specifies how
the container works, without specifying what kinds of things it contains.
# $x is implemented by the MyScalar class
my $x is MyScalar;
Constraints and container types can be used together:
# $x can contain only Int objects,
# and is implemented by the MyScalar class
my Int $x is MyScalar;
Note that C<$x> is also initialized to the C<Int> protoobject. See below for more on this.
=item *
C<my Dog $spot> by itself does not automatically call a C<Dog> constructor.
It merely assigns an undefined C<Dog> prototype object to C<$spot>:
my Dog $spot; # $spot is initialized with ::Dog
my Dog $spot = Dog; # same thing
$spot.defined; # False
say $spot; # "Dog"
Any type name used as a value is an undefined instance of
that type's prototype object, or I<protoobject>. See S12 for more on that.
Any type name in rvalue context is parsed as a single protoobject value and
expects no arguments following it. However, a protoobject responds to the function
call interface, so you may use the name of a protoobject with parentheses as if it
were a function, and any argument supplied to the call is coerced
to the type indicated by the protoobject. If there is no argument
in the parentheses, the protoobject returns itself:
my $type = Num; # protoobject as a value
$num = $type($string) # coerce to Num
To get a real C<Dog> object, call a constructor method such as C<new>:
my Dog $spot .= new;
my Dog $spot = $spot.new; # .= is rewritten into this
You can pass in arguments to the constructor as well:
my Dog $cerberus .= new(heads => 3);
my Dog $cerberus = $cerberus.new(heads => 3); # same thing
=item *
If you say
my int @array is MyArray;
you are declaring that the elements of C<@array> are native integers,
but that the array itself is implemented by the C<MyArray> class.
Untyped arrays and hashes are still perfectly acceptable, but have
the same performance issues they have in Perl 5.
=item *
To get the number of elements in an array, use the C<.elems> method. You can
also ask for the total string length of an array's elements, in bytes,
codepoints or graphemes, using these methods C<.bytes>, C<.codes> or C<.graphs>
respectively on the array. The same methods apply to strings as well.
(Note that C<.bytes> is not guaranteed to be well-defined when the encoding
is unknown. Similarly, C<.codes> is not well-defined unless you know which
canonicalization is in effect. Hence, both methods allow an optional argument
to specify the meaning exactly if it cannot be known from context.)
There is no C<.length> method for either arrays or strings, because C<length>
does not specify a unit.
=item *
Built-in object types start with an uppercase letter. This includes
immutable types (e.g. C<Int>, C<Num>, C<Complex>, C<Rat>, C<Str>,
C<Bit>, C<Regex>, C<Set>, C<Junction>, C<Code>, C<Block>, C<List>,
C<Seq>), as well as mutable (container) types, such as C<Scalar>,
C<Array>, C<Hash>, C<Buf>, C<Routine>, C<Module>, etc.
Non-object (native) types are lowercase: C<int>, C<num>, C<complex>,
C<rat>, C<buf>, C<bit>. Native types are primarily intended for
declaring compact array storage. However, Perl will try to make those
look like their corresponding uppercase types if you treat them that way.
(In other words, it does autoboxing. Note, however, that sometimes
repeated autoboxing can slow your program more than the native type
can speed it up.)
Some object types can behave as value types. Every object can produce
a "WHICH" value that uniquely identifies the
object for hashing and other value-based comparisons. Normal objects
just use their address in memory, but if a class wishes to behave as a
value type, it can define a C<.WHICH> method that makes different objects
look like the same object if they happen to have the same contents.
=item *
Variables with non-native types can always contain I<undefined> values,
such as C<Object>, C<Whatever> and C<Failure> objects. See S04 for more
about failures (i.e. unthrown exceptions):
my Int $x = undef; # works
Variables with native types do not support undefinedness: it is an error
to assign an undefined value to them:
my int $y = undef; # dies
Conjecture: num might support the autoconversion of undef to NaN, since
the floating-point form can represent this concept. Might be better
to make that conversion optional though, so that the rocket designer
can decide whether to self-destruct immediately or shortly thereafter.
Variables of non-native types start out containing an undefined value
unless explicitly initialized to a defined value.
=item *
Every object supports a C<HOW> function/method that returns the
metaclass instance managing it, regardless of whether the object
is defined:
'x'.HOW.methods; # get available methods for strings
Str.HOW.methods; # same thing with the prototype object Str
HOW(Str).methods; # same thing as function call
'x'.methods; # this is likely an error - not a meta object
Str.methods; # same thing
(For a prototype system (a non-class-based object system), all objects are merely managed by the same meta object.)
=item *
Perl 6 intrinsically supports big integers and rationals through its
system of type declarations. C<Int> automatically supports promotion
to arbitrary precision, as well as holding C<Inf> and C<NaN> values.
Note that C<Int> assumes 2's complement arithmetic, so C<+^1 == -2>
is guaranteed. (Native C<int> operations need not support this on
machines that are not natively 2's complement. You must convert to
and from C<Int> to do portable bitops on such ancient hardware.)
(C<Num> may support arbitrary-precision floating-point arithmetic, but
is not required to unless we can do so portably and efficiently. C<Num>
must support the largest native floating point format that runs at full speed.)
C<Rat> supports arbitrary precision rational arithmetic. However,
dividing two C<Int> objects using C<< infix:</> >> produces a
fraction of C<Num> type, not a ratio. You can produce a ratio by
using C<< infix:<div> >> on two integers instead.
Lower-case types like C<int> and C<num> imply the native
machine representation for integers and floating-point numbers,
respectively, and do not promote to arbitrary precision, though
larger representations are always allowed for temporary values.
Unless qualified with a number of bits, C<int> and C<num> types represent
the largest native integer and floating-point types that run at full speed.
Numeric values in untyped variables use C<Int> and C<Num> semantics
rather than C<int> and C<num>.
=item *
Perl 6 should by default make standard IEEE floating point concepts
visible, such as C<Inf> (infinity) and C<NaN> (not a number). Within a
lexical scope, pragmas may specify the nature of temporary values,
and how floating point is to behave under various circumstances.
All IEEE modes must be lexically available via pragma except in cases
where that would entail heroic efforts to bypass a braindead platform.
The default floating-point modes do not throw exceptions but rather
propagate Inf and NaN. The boxed object types may carry more detailed
information on where overflow or underflow occurred. Numerics in Perl
are not designed to give the identical answer everywhere. They are
designed to give the typical programmer the tools to achieve a good
enough answer most of the time. (Really good programmers may occasionally
do even better.) Mostly this just involves using enough bits that the
stupidities of the algorithm don't matter much.
=item *
A C<Str> is a Unicode string object. There is no corresponding native
C<str> type. However, since a C<Str> object may fill multiple roles,
we say that a C<Str> keeps track of its minimum and maximum Unicode
abstraction levels, and plays along nicely with the current lexical
scope's idea of the ideal character, whether that is bytes, codepoints,
graphemes, or characters in some language. For all builtin operations,
all C<Str> positions are reported as position objects, not integers.
These C<StrPos> objects point into a particular string at a particular
location independent of abstraction level, either by tracking the
string and position directly, or by generating an abstraction-level
independent representation of the offset from the beginning of the
string that will give the same results if applied to the same string
in any context. This is assuming the string isn't modified in the
meanwhile; a C<StrPos> is not a "marker" and is not required to follow
changes to a mutable string. For instance, if you ask for the positions
of matches done by a substitution, the answers are reported in terms of the
original string (which may now be inaccessible!), not as positions within
the modified string.
The subtraction of two C<StrPos> objects gives a C<StrLen> object,
which is also not an integer, because the string between two positions
also has multiple integer interpretations depending on the units.
A given C<StrLen> may know that it represents 18 bytes, 7 codepoints,
3 graphemes, and 1 letter in Malayalam, but it might only know this
lazily because it actually just hangs onto the two C<StrPos> endpoints
within the string that in turn may or may not just lazily point into
the string. (The lazy implementation of C<StrLen> is much like a
C<Range> object in that respect.)
If you use integers as arguments where position objects are expected,
it will be assumed that you mean the units of the current lexically
scoped Unicode abstraction level. (Which defaults to graphemes.)
Otherwise you'll need to coerce to the proper units:
substr($string, Bytes(42), ArabicChars(1))
Of course, such a dimensional number will fail if used on a string
that doesn't provide the appropriate abstraction level.
If a C<StrPos> or C<StrLen> is forced into a numeric context, it will
assume the units of the current Unicode abstraction level. It is
erroneous to pass such a non-dimensional number to a routine that
would interpret it with the wrong units.
Implementation note: since Perl 6 mandates that the default Unicode
processing level must view graphemes as the fundamental unit rather
than codepoints, this has some implications regarding efficient
implementation. It is suggested that all graphemes be translated on
input to a unique grapheme numbers and represented as integers within
some kind of uniform array for fast substr access. For those graphemes
that have a precomposed form, use of that codepoint is suggested.
(Note that this means Latin-1 can still be represented internally
with 8-bit integers.)
For graphemes that have no precomposed form, a temporary private
id should be assigned that uniquely identifies the grapheme.
If such ids are assigned consistently thoughout the process,
comparison of two graphemes is no more difficult than the comparison
of two integers, and comparison of base characters no more difficult
than a direct lookup into the id-to-NFD table.
Obviously, any temporary grapheme ids must be translated back to
some universal form (such as NFD) on output, and normal precomposed
graphemes may turn into either NFC or NFD forms depending on the
desired output. Maintaining a particular grapheme/id mapping over the
life of the process may have some GC implications for long-running
processes, but most processes will likely see a limited number of
non-precomposed graphemes.
If the program has a scope that wants a codepoint view rather than
a grapheme view, the string visible to that lexical scope must also
be translated to universal form, just as with output translation.
Alternately, the temporary grapheme ids may be hidden behind an
abstraction layer. In any case, codepoint scope should never see
any temporary grapheme ids. (The lexical codepoint declaration
should probably specify which normalization form it prefers to
view strings under. Such a declaration could be applied to input
translation as well.)
=item *
A C<Buf> is a stringish view of an array of
integers, and has no Unicode or character properties without explicit
conversion to some kind of C<Str>. (A C<buf> is the native counterpart.)
Typically it's an array of bytes serving as a buffer. Bitwise
operations on a C<Buf> treat the entire buffer as a single large
integer. Bitwise operations on a C<Str> generally fail unless the
C<Str> in question can provide an abstract C<Buf> interface somehow.
Coercion to C<Buf> should generally invalidate the C<Str> interface.
As a generic type C<Buf> may be instantiated as (or bound to) any
of C<buf8>, C<buf16>, or C<buf32> (or to any type that provides the
appropriate C<Buf> interface), but when used to create a buffer C<Buf>
defaults to C<buf8>.
Unlike C<Str> types, C<Buf> types prefer to deal with integer string
positions, and map these directly to the underlying compact array
as indices. That is, these are not necessarily byte positions--an
integer position just counts over the number of underlying positions,
where one position means one cell of the underlying integer type.
Builtin string operations on C<Buf> types return integers and expect
integers when dealing with positions. As a limiting case, C<buf8> is
just an old-school byte string, and the positions are byte positions.
Note, though, that if you remap a section of C<buf32> memory to be
C<buf8>, you'll have to multiply all your positions by 4.
=item *
The C<*> character as a standalone term captures the notion of
"Whatever", which is applied lazily by whatever operator it is an
argument to. Generally it can just be thought of as a "glob" that
gives you everything it can in that argument position. For instance:
if $x ~~ 1..* {...} # if 1 <= $x <= +Inf
my ($a,$b,$c) = "foo" xx *; # an arbitrary long list of "foo"
if /foo/ ff * {...} # a latching flipflop
@slice = @x[*;0;*]; # any Int
@slice = %x{*;'foo'}; # any keys in domain of 1st dimension
@array[*] # flattens, unlike @array[]
(*, *, $x) = (1, 2, 3); # skip first two elements
# (same as lvalue "undef" in Perl 5)
C<Whatever> is an undefined prototype object derived from C<Any>. As a
type it is abstract, and may not be instantiated as a defined object.
If for a particular MMD dispatch, nothing in the MMD system claims it,
it dispatches to as an C<Any> with an undefined value, and usually
blows up constructively. If you say
say 1 + *;
you should probably not expect it to yield a reasonable answer, unless
you think an exception is reasonable. Since the C<Whatever> object
is effectively immutable, the optimizer is free to recognize C<*>
and optimize in the context of what operator it is being passed to.
Most of the built-in numeric operators treat an argument of C<*> as
indicating the desire to create a function of a single unknown, so:
* - 1
produces a function of a single argument:
{ $_ - 1 }
Likewise, the single dispatcher recognizes C<*.meth> and returns C<{ $_.meth }>,
so it can be used where patterns are expected:
@primes = grep *.prime, 2..*;
These closures are of type C<Code:($)>, not C<Whatever>, so that constructs can distinguish
via multiple dispatch:
1,2,3 ... *
1,2,3 ... *+1
The bare C<*> form may also be called as a function, and represents the identify function:
*(42) == 42
(* + 1)(42) == 43
But note that this is I<not> what is happening above, or
1,2,3 ... *
would end up meaning:
1,2,3,3,3,3,3,3...
The C<...> operator is instead dispatching bare C<*> to a routine that
does dwimmery, and in this case decides to supply a function { * + 1 }.
The final element of an array is subscripted as C<@a[*-1]>,
which means that when the subscripting operation discovers a C<Code>
object for a subscript, it calls it and supplies an argument indicating
the number of elements in (that dimension of) the array. See S09.
A variant of C<*> is the C<**> term, which is of type C<HyperWhatever>.
It is generally understood to be a multidimension form of C<*> when
that makes sense. When modified by an operator that would turn C<*>
into a function of one argument, C<**> instead turns into a function
with a slurpy argument, of type C<Code:(*@)>. That is:
* - 1 means -> $x { $x - 1 }
** - 1 means -> *@x { map -> $x { $x - 1 }, @x }
Therefore C<@array[^**]> represents C<< @array[{ map { ^* }, @_ }] >>,
that is to say, every element of the array, no matter how many dimensions.
(However, C<@array[**]> means the same thing because (as with C<...>
above), the subscript operator will interpret bare C<**> as meaning
all the subscripts, not the list of dimension sizes. The meaning of
C<Whatever> is always controlled by its immediate context.)
Other uses for C<*> and C<**> will doubtless suggest themselves
over time. These can be given meaning via the MMD system, if not
the compiler. In general a C<Whatever> should be interpreted as
maximizing the degrees of freedom in a dwimmy way, not as a nihilistic
"don't care anymore--just shoot me".
=back
=head2 Native types
Values with these types autobox to their uppercase counterparts when
you treat them as objects:
bit single native bit
int native signed integer
uint native unsigned integer (autoboxes to Int)
buf native buffer (finite seq of native ints or uints, no Unicode)
num native floating point
complex native complex number
bool native boolean
Since native types cannot represent Perl's concept of undefined values,
in the absence of explicit initialization, native floating-point types
default to NaN, while integer types (including C<bit>) default to 0.
The complex type defaults to NaN + NaN.i. A buf type of known size
defaults to a sequence of 0 values. If any native type is explicitly
initialized to C<*> (the C<Whatever> type), no initialization is attempted
and you'll get whatever was already there when the memory was allocated.
If a buf type is initialized with a Unicode string value, the string
is decomposed into Unicode codepoints, and each codepoint shoved into
an integer element. If the size of the buf type is not specified,
it takes its length from the initializing string. If the size
is specified, the initializing string is truncated or 0-padded as
necessary. If a codepoint doesn't fit into a buf's integer type,
a parse error is issued if this can be detected at compile time;
otherwise a warning is issued at run time and the overflowed buffer
element is filled with an appropriate replacement character, either
C<U+FFFD> (REPLACEMENT CHARACTER) if the element's integer type is at
least 16 bits, or C<U+007f> (DELETE) if the larger value would not fit.
If any other conversion is desired, it must be specified explicitly.
In particular, no conversion to UTF-8 or UTF-16 is attempted; that
must be specified explicitly. (As it happens, conversion to a buf
type based on 32-bit integers produces valid UTF-32 in the native
endianness.)
=head2 Undefined types
These can behave as values or objects of any class, except that
C<defined> always returns false. One can create them with the
built-in C<undef> and C<fail> functions. (See S04 for how failures
are handled.)
Nil Empty list viewed as an item
Object Uninitialized (derivatives serve as protoobjects of classes)
Whatever Wildcard (like undef, but subject to do-what-I-mean via MMD)
Failure Failure (lazy exceptions, thrown if not handled properly)
Whenever you declare any kind of type, class, module, or package, you're
automatically declaring a undefined prototype value with the same name.
Whenever a C<Failure> value is put into a typed container, it takes
on the type specified by the container but continues to carry the
C<Failure> role. (The C<undef> function merely returns the most
generic C<Failure> object. Use C<fail> to return more specific failures. Use
C<Object> for the most generic non-failure undefined value. The C<Any>
type is also undefined, but excludes C<Junctions> so that autothreading
may be dispatched using normal multiple dispatch rules.)
The C<Nil> type is officially undefined as an item but interpolates
as a null list into list context, and an empty capture into slice
context. A C<Nil> object may also carry failure information,
but if so, the object behaves as a failure only in item context.
Use C<Failure>/C<undef> when you want to return a hard failure that
will not evaporate in list context.
=head2 Immutable types
Objects with these types behave like values, i.e. C<$x === $y> is true
if and only if their types and contents are identical (that is, if
C<$x.WHICH> eqv C<$y.WHICH>).
Bit Perl single bit (allows traits, aliasing, undef, etc.)
Int Perl integer (allows Inf/NaN, arbitrary precision, etc.)
Str Perl string (finite sequence of Unicode characters)
Num Perl number
Rat Perl rational
Complex Perl complex number
Bool Perl boolean
Exception Perl exception
Code Base class for all executable objects
Block Executable objects that have lexical scopes
List Lazy Perl list (composed of immutables and iterators)
Seq Completely evaluated (hence immutable) sequence
Range A pair of Ordered endpoints; gens immutables when iterated
Set Unordered collection of values that allows no duplicates
Bag Unordered collection of values that allows duplicates
Junction Set with additional behaviors
Signature Function parameters (left-hand side of a binding)
Capture Function call arguments (right-hand side of a binding)
Blob An undifferentiated mass of bits
Instant A point on the continuous atomic timeline (TAI)
Duration The difference between two Instants
Insofar as Lists are lazy, they're really only partially immutable, in
the sense that the past is fixed but the future is not. The portion of
a List yet to be determined by iterators may depend on mutable values.
When an iterator is called upon to iterate and extend the known part
of the list, some number of immutable values (which includes immutable
references to mutable objects) are decided and locked in at that point.
Iterators may have several different ways of iterating depending on
the degree of laziness/eagerness desired in context. The iterator
API is described in S07.
C<Instant>s and C<Duration>s are measured in atomic seconds with
fractions. Notionally they are real numbers which may be implemented
in either C<Num> or C<Rat> types. (Fixed-point implementations are
strongly discouraged.) Interfaces that take C<Duration> arguments,
such as sleep(), may also take C<Num> arguments, but C<Instant>
arguments must be explicitly created via any of various culturally
aware time specification APIs that, by and large, are outside the
CORE of Perl 6, with the possible exception of a constructor taking a
native TAI value. In numeric context a C<Duration> happily returns a
C<Num> representing seconds. If pressed for a number, an C<Instant>
will return the length of time in atomic seconds from the TAI epoch,
but it will be unhappy about it. Systems which cannot provide
a steady time base, such as POSIX systems, will simply have to make