Skip to content

Commit 84d30c3

Browse files
authored
Add reference to Whitespace (#4334)
+ Gammmmar tweaks + Add more examples + Align examples of Unicode naming corrections for clarity - Use comments to distinguish correct versus incorrect naming versions
1 parent a061271 commit 84d30c3

File tree

4 files changed

+38
-26
lines changed

4 files changed

+38
-26
lines changed

CONTRIBUTING.md

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -77,7 +77,7 @@ channel and/or the [issues for this repository](https://github.com/Raku/doc/issu
7777
before you proceed further. After you get consensus on a title, subtitle,
7878
section, and filename, you can add the document by following these steps:
7979

80-
+ create a **filename.pod6** file in the **doc/Language** directory and
80+
+ create a **filename.rakudoc** file in the **doc/Language** directory and
8181
ensure it adheres to the conventions in
8282
[CREATING-NEW-DOCS.md](writing-docs/CREATING-NEW-DOCS.md).
8383

@@ -97,9 +97,9 @@ with the helper tool `util/new-type.raku`. Say you want to create `MyFunnyRole`:
9797

9898
$ raku util/new-type.raku --kind=role MyFunnyRole
9999

100-
Fill the documentation file `doc/Type/MyFunnyRole.pod6` like this:
100+
Fill the documentation file `doc/Type/MyFunnyRole.rakudoc` like this:
101101

102-
```perl6
102+
```raku
103103
=TITLE role MyFunnyRole
104104
105105
=SUBTITLE Sentence or half-sentence about what it does
@@ -131,7 +131,7 @@ comment `Z<>`.
131131

132132
When providing a code example result or output, use this style:
133133

134-
```perl6
134+
```raku
135135
# For the result of an expression.
136136
1 + 2; # RESULT: «3»
137137
# For the output.
@@ -182,7 +182,7 @@ to display heading numbers.
182182

183183
Report issues with the content on [github](https://github.com/Raku/doc/issues).
184184
This includes missing or incorrect documentation, as well as information about
185-
versioning (e.g. "method foo" only available in raku v6.d).
185+
versioning (e.g., "method foo" only available in raku v6.d).
186186

187187
For issues with the website functionality (as opposed to the content), for
188188
examples issues with search,

doc/Language/regexes.rakudoc

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -738,8 +738,8 @@ write the backslashed forms for character classes between the C<[ ]>.
738738

739739
You can include Unicode properties in the list as well:
740740

741-
/<:Zs + [\x9] - [\xA0]>/
742-
# Any character with "Zs" property, or a tab, but not a "no-break space"
741+
/<:Zs + [\x9] - [\xA0] - [\x202F] >/
742+
# Any character with "Zs" property, or a tab, but not a "no-break space" or "narrow no-break space"
743743

744744
To negate a character class, put a C<-> after the opening angle bracket:
745745

doc/Language/unicode.rakudoc

Lines changed: 21 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -35,21 +35,21 @@ Raku will turn both these inputs into one codepoint, as is specified for
3535
Normalization Form C (B<X<NFC|Reference,NFC>>). In most cases this is useful and means
3636
that two inputs that are equivalent are both treated the same. Unicode has a concept
3737
of canonical equivalence which allows us to determine the canonical form of a string,
38-
allowing us to properly compare strings and manipulate them, without having to worry
38+
thus allowing us to properly compare strings and manipulate them without having to worry
3939
about the text losing these properties. By default, any text you process or output
4040
from Raku will be in this “canonical” form, even when making modifications or
4141
concatenations to the string (see below for how to avoid this). For more detailed information
4242
about Normalization Form C and canonical equivalence, see the Unicode Foundation's page on
4343
L<Normalization and Canonical Equivalence|https://unicode.org/reports/tr15/#Canon_Compat_Equivalence>.
4444

45-
One case where we don't default to this, is for the names of files. This is because
45+
One case where we don't default to this is for the names of files. This is because
4646
the names of files must be accessed exactly as the bytes are written on the disk.
4747

4848
To avoid normalization you can use a special encoding format called L<UTF8-C8|#UTF8-C8>.
4949
Using this encoding with any filehandle will allow you to read the exact bytes as they are
50-
on disk, without normalization. They may look funny when printed out, if you print it out using a
50+
on disk without normalization. They may look funny when printed out if you use a
5151
UTF8 handle. If you print it out to a handle where the output encoding is UTF8-C8,
52-
then it will render as you would normally expect, and be a byte for byte exact
52+
then it will render as you would normally expect as a byte-for-byte exact
5353
copy. More technical details on L<UTF8-C8|#UTF8-C8> on MoarVM are described below.
5454

5555
=head2 X<UTF8-C8|Reference,UTF8-C8>
@@ -60,7 +60,7 @@ UTF-8, or that would not round-trip due to normalization, it will use
6060
L<NFG synthetics|/language/glossary#NFG>
6161
to keep track of the original bytes involved.
6262
This means that encoding back to UTF-8 Clean-8 will be able to recreate the
63-
bytes as they originally existed. The synthetics contain 4 codepoints:
63+
bytes as they originally existed. The synthetics contain four codepoints:
6464

6565
=item The codepoint 0x10FFFD (which is a private use codepoint)
6666
=item The codepoint 'x'
@@ -71,13 +71,13 @@ Under normal UTF-8 encoding, this means the unrepresentable characters will
7171
come out as something like C<?xFF>.
7272

7373
UTF-8 Clean-8 is used in places where MoarVM receives strings from the
74-
environment, command line arguments, and filesystem queries, for instance when decoding buffers:
74+
environment, command line arguments, and filesystem queries; for instance when decoding buffers:
7575

7676
say Buf.new(ord('A'), 0xFE, ord('Z')).decode('utf8-c8');
7777
# OUTPUT: «A􏿽xFEZ␤»
7878

79-
You can see how the two initial codepoints used by UTF8-C8 show up here, right
80-
before the "FE". You can use this type of encoding to read files with unknown
79+
You can see how the two initial codepoints used by UTF8-C8 show up below right
80+
before the 'FE'. You can use this type of encoding to read files with unknown
8181
encoding:
8282

8383
my $test-file = "/tmp/test";
@@ -126,13 +126,13 @@ the L<uniparse|/routine/uniparse>:
126126

127127
See L<uniname|/routine/uniname> and L<uninames|/routine/uninames> for routines
128128
that work in the opposite direction with a single codepoint and multiple
129-
codepoints respectively.
129+
codepoints, respectively.
130130

131131
=head2 Name aliases
132132

133133
Name Aliases are used mainly for codepoints without an official
134134
name, for abbreviations, or for corrections (Unicode names never change).
135-
For full list of them see L<here|https://www.unicode.org/Public/UCD/latest/ucd/NameAliases.txt>.
135+
For a full list of them see L<here|https://www.unicode.org/Public/UCD/latest/ucd/NameAliases.txt>.
136136

137137
Control codes without any official name:
138138

@@ -141,16 +141,22 @@ Control codes without any official name:
141141

142142
Corrections:
143143

144-
say "\c[LATIN CAPITAL LETTER GHA]"; # OUTPUT: «Ƣ␤»
144+
# Correct name as input:
145+
say "\c[LATIN CAPITAL LETTER GHA]"; # OUTPUT: «Ƣ␤»
146+
# Original, erroneous name as output:
145147
say "Ƣ".uniname; # OUTPUT: «LATIN CAPITAL LETTER OI␤»
148+
146149
# This one is a spelling mistake that was corrected in a Name Alias:
147-
say "\c[PRESENTATION FORM FOR VERTICAL RIGHT WHITE LENTICULAR BRACKET]".uniname;
150+
# Correct name as input:
151+
say "\c[PRESENTATION FORM FOR VERTICAL RIGHT WHITE LENTICULAR BRACKET]".uniname;
152+
# Original, erroneous name as output:
148153
# OUTPUT: «PRESENTATION FORM FOR VERTICAL RIGHT WHITE LENTICULAR BRAKCET␤»
149154

150155
Abbreviations:
151156

152-
say "\c[ZWJ]".uniname; # OUTPUT: «ZERO WIDTH JOINER␤»
153-
say "\c[NBSP]".uniname; # OUTPUT: «NO-BREAK SPACE␤»
157+
say "\c[ZWJ]".uniname; # OUTPUT: «ZERO WIDTH JOINER␤»
158+
say "\c[NBSP]".uniname; # OUTPUT: «NO-BREAK SPACE␤»
159+
say "\c[NNBSP]".uniname; # OUTPUT: «NARROW NO-BREAK SPACE␤»
154160

155161
=head2 Named sequences
156162

@@ -166,7 +172,7 @@ Raku supports Emoji sequences.
166172
For all of them see:
167173
L<Emoji ZWJ Sequences|https://www.unicode.org/Public/emoji/4.0/emoji-zwj-sequences.txt>
168174
and L<Emoji Sequences|https://www.unicode.org/Public/emoji/4.0/emoji-sequences.txt>.
169-
Note that any names with commas should have their commas removed, since Raku uses
175+
Note that any names with commas should have their commas removed since Raku uses
170176
commas to separate different codepoints/sequences inside the same C<\c> sequence.
171177

172178
say "\c[woman gesturing OK]"; # OUTPUT: «🙆‍♀️␤»

doc/Language/unicode_ascii.rakudoc

Lines changed: 10 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -31,7 +31,7 @@ Any codepoint that has the C<Nd> (Number, decimal digit) property, can
3131
be used as a digit in any number. For example:
3232

3333
my $var = 19; # U+FF11 U+FF19
34-
say $var + 2; # OUTPUT: «21␤»
34+
say $var + 2; # OUTPUT: «21␤»
3535

3636
=head1 Numeric values
3737

@@ -40,14 +40,20 @@ property can be used standalone as a numeric value, such as ½ and ⅓. (These
4040
aren't decimal digit characters, so can't be combined.) For example:
4141

4242
my $var = ⅒ + 2 + Ⅻ; # here ⅒ is No and Rat and Ⅻ is Nl and Int
43-
say $var; # OUTPUT: «14.1␤»
43+
say $var; # OUTPUT: «14.1␤»
4444

45-
=head1 Whitespace characters
45+
=head1 X<Language,Whitespace> Whitespace characters
4646

47-
Besides spaces and tabs you can use any other unicode whitespace
47+
Besides spaces and tabs, you can use any other unicode whitespace
4848
character that has the C<Zs> (Separator, space), C<Zl> (Separator,
4949
line), or C<Zp> (Separator, paragraph) property.
5050

51+
See Wikipedia's L<Whitespace|https://en.m.wikipedia.org/wiki/Whitespace_character>
52+
section for detailed
53+
tables of the Unicode codepoints with (or associated with)
54+
whitespace characteristics. This is an important section for Raku
55+
authors of digital typeography modules for print or web use.
56+
5157
=head1 Other acceptable single codepoints
5258

5359
This list contains the single codepoints [and their ASCII

0 commit comments

Comments
 (0)