@@ -35,21 +35,21 @@ Raku will turn both these inputs into one codepoint, as is specified for
35
35
Normalization Form C (B<X<NFC|Reference,NFC>>). In most cases this is useful and means
36
36
that two inputs that are equivalent are both treated the same. Unicode has a concept
37
37
of canonical equivalence which allows us to determine the canonical form of a string,
38
- allowing us to properly compare strings and manipulate them, without having to worry
38
+ thus allowing us to properly compare strings and manipulate them without having to worry
39
39
about the text losing these properties. By default, any text you process or output
40
40
from Raku will be in this “canonical” form, even when making modifications or
41
41
concatenations to the string (see below for how to avoid this). For more detailed information
42
42
about Normalization Form C and canonical equivalence, see the Unicode Foundation's page on
43
43
L<Normalization and Canonical Equivalence|https://unicode.org/reports/tr15/#Canon_Compat_Equivalence>.
44
44
45
- One case where we don't default to this, is for the names of files. This is because
45
+ One case where we don't default to this is for the names of files. This is because
46
46
the names of files must be accessed exactly as the bytes are written on the disk.
47
47
48
48
To avoid normalization you can use a special encoding format called L<UTF8-C8|#UTF8-C8>.
49
49
Using this encoding with any filehandle will allow you to read the exact bytes as they are
50
- on disk, without normalization. They may look funny when printed out, if you print it out using a
50
+ on disk without normalization. They may look funny when printed out if you use a
51
51
UTF8 handle. If you print it out to a handle where the output encoding is UTF8-C8,
52
- then it will render as you would normally expect, and be a byte for byte exact
52
+ then it will render as you would normally expect as a byte- for- byte exact
53
53
copy. More technical details on L<UTF8-C8|#UTF8-C8> on MoarVM are described below.
54
54
55
55
=head2 X<UTF8-C8|Reference,UTF8-C8>
@@ -60,7 +60,7 @@ UTF-8, or that would not round-trip due to normalization, it will use
60
60
L<NFG synthetics|/language/glossary#NFG>
61
61
to keep track of the original bytes involved.
62
62
This means that encoding back to UTF-8 Clean-8 will be able to recreate the
63
- bytes as they originally existed. The synthetics contain 4 codepoints:
63
+ bytes as they originally existed. The synthetics contain four codepoints:
64
64
65
65
=item The codepoint 0x10FFFD (which is a private use codepoint)
66
66
=item The codepoint 'x'
@@ -71,13 +71,13 @@ Under normal UTF-8 encoding, this means the unrepresentable characters will
71
71
come out as something like C<?xFF>.
72
72
73
73
UTF-8 Clean-8 is used in places where MoarVM receives strings from the
74
- environment, command line arguments, and filesystem queries, for instance when decoding buffers:
74
+ environment, command line arguments, and filesystem queries; for instance when decoding buffers:
75
75
76
76
say Buf.new(ord('A'), 0xFE, ord('Z')).decode('utf8-c8');
77
77
# OUTPUT: «AxFEZ»
78
78
79
- You can see how the two initial codepoints used by UTF8-C8 show up here, right
80
- before the "FE" . You can use this type of encoding to read files with unknown
79
+ You can see how the two initial codepoints used by UTF8-C8 show up below right
80
+ before the 'FE' . You can use this type of encoding to read files with unknown
81
81
encoding:
82
82
83
83
my $test-file = "/tmp/test";
@@ -126,13 +126,13 @@ the L<uniparse|/routine/uniparse>:
126
126
127
127
See L<uniname|/routine/uniname> and L<uninames|/routine/uninames> for routines
128
128
that work in the opposite direction with a single codepoint and multiple
129
- codepoints respectively.
129
+ codepoints, respectively.
130
130
131
131
=head2 Name aliases
132
132
133
133
Name Aliases are used mainly for codepoints without an official
134
134
name, for abbreviations, or for corrections (Unicode names never change).
135
- For full list of them see L<here|https://www.unicode.org/Public/UCD/latest/ucd/NameAliases.txt>.
135
+ For a full list of them see L<here|https://www.unicode.org/Public/UCD/latest/ucd/NameAliases.txt>.
136
136
137
137
Control codes without any official name:
138
138
@@ -141,16 +141,22 @@ Control codes without any official name:
141
141
142
142
Corrections:
143
143
144
- say "\c[LATIN CAPITAL LETTER GHA]"; # OUTPUT: «Ƣ»
144
+ # Correct name as input:
145
+ say "\c[LATIN CAPITAL LETTER GHA]"; # OUTPUT: «Ƣ»
146
+ # Original, erroneous name as output:
145
147
say "Ƣ".uniname; # OUTPUT: «LATIN CAPITAL LETTER OI»
148
+
146
149
# This one is a spelling mistake that was corrected in a Name Alias:
147
- say "\c[PRESENTATION FORM FOR VERTICAL RIGHT WHITE LENTICULAR BRACKET]".uniname;
150
+ # Correct name as input:
151
+ say "\c[PRESENTATION FORM FOR VERTICAL RIGHT WHITE LENTICULAR BRACKET]".uniname;
152
+ # Original, erroneous name as output:
148
153
# OUTPUT: «PRESENTATION FORM FOR VERTICAL RIGHT WHITE LENTICULAR BRAKCET»
149
154
150
155
Abbreviations:
151
156
152
- say "\c[ZWJ]".uniname; # OUTPUT: «ZERO WIDTH JOINER»
153
- say "\c[NBSP]".uniname; # OUTPUT: «NO-BREAK SPACE»
157
+ say "\c[ZWJ]".uniname; # OUTPUT: «ZERO WIDTH JOINER»
158
+ say "\c[NBSP]".uniname; # OUTPUT: «NO-BREAK SPACE»
159
+ say "\c[NNBSP]".uniname; # OUTPUT: «NARROW NO-BREAK SPACE»
154
160
155
161
=head2 Named sequences
156
162
@@ -166,7 +172,7 @@ Raku supports Emoji sequences.
166
172
For all of them see:
167
173
L<Emoji ZWJ Sequences|https://www.unicode.org/Public/emoji/4.0/emoji-zwj-sequences.txt>
168
174
and L<Emoji Sequences|https://www.unicode.org/Public/emoji/4.0/emoji-sequences.txt>.
169
- Note that any names with commas should have their commas removed, since Raku uses
175
+ Note that any names with commas should have their commas removed since Raku uses
170
176
commas to separate different codepoints/sequences inside the same C<\c> sequence.
171
177
172
178
say "\c[woman gesturing OK]"; # OUTPUT: «🙆♀️»
0 commit comments