Skip to content

Commit

Permalink
Browse files Browse the repository at this point in the history
Update design docs for how Str/NFG has worked out.
Of note, the StrLen and StrPos types are gone, and Str only works at
grapheme level; as per S15 we have Uni and its subtypes for working
at codepoint level, and Buf for working at bytes level.
  • Loading branch information
jnthn committed Jun 6, 2015
1 parent 415fa4c commit 6e1c9d4
Show file tree
Hide file tree
Showing 5 changed files with 44 additions and 108 deletions.
94 changes: 26 additions & 68 deletions S02-bits.pod
Expand Up @@ -912,66 +912,24 @@ algorithm don't matter much.

=head2 Strings, the C<Str> Type

A C<Str> is a Unicode string object. There is no corresponding native
C<str> type. However, since a C<Str> object may fill multiple roles, we say
that a C<Str> keeps track of its minimum and maximum Unicode abstraction
levels, and plays along nicely with the current lexical scope's idea of the
ideal character, whether that is bytes, codepoints, graphemes, or characters
in some language.

=head3 The C<StrPos> Type

For all builtin operations, all C<Str> positions are reported as position
objects, not integers. These C<StrPos> objects point into a particular
string at a particular location independent of abstraction level, either by
tracking the string and position directly, or by generating an
abstraction-level independent representation of the offset from the
beginning of the string that will give the same results if applied to the
same string in any context. This is assuming the string isn't modified in
the meanwhile; a C<StrPos> is not a "marker" and is not required to follow
changes to a mutable string. For instance, if you ask for the positions of
matches done by a substitution, the answers are reported in terms of the
original string (which may now be inaccessible!), not as positions within
the modified string.

=head3 The C<StrLen> Type

The subtraction of two C<StrPos> objects gives a C<StrLen> object, which is
also not an integer, because the string between two positions also has
multiple integer interpretations depending on the units. A given C<StrLen>
may know that it represents 7 codepoints, 3 graphemes, and 1 letter in
Malayalam, but it might only know this lazily because it actually just hangs
onto the two C<StrPos> endpoints within the string that in turn may or may
not just lazily point into the string. (The lazy implementation of
C<StrLen> is much like a C<Range> object in that respect.)

=head3 Units of Position Arguments

If you use integers as arguments where position objects are expected, it
will be assumed that you mean the units of the current lexically scoped
Unicode abstraction level. (Which defaults to graphemes.) Otherwise you'll
need to coerce to the proper units:

substr($string, Bytes(42), ArabicChars(1))

Of course, such a dimensional number will fail if used on a string that
doesn't provide the appropriate abstraction level.

=head3 Numeric Coercion of C<StrPos> or C<StrLen>

If a C<StrPos> or C<StrLen> is forced into a numeric context, it will assume
the units of the current Unicode abstraction level. It is erroneous to pass
such a non-dimensional number to a routine that would interpret it with the
wrong units.

Implementation note: since Perl 6 mandates that the default Unicode
processing level must view graphemes as the fundamental unit rather than
codepoints, this has some implications regarding efficient implementation.
It is suggested that all graphemes be translated on input to unique grapheme
numbers and represented as integers within some kind of uniform array for
fast substr access. For those graphemes that have a precomposed form, use
of that codepoint is suggested. (Note that this means Latin-1 can still be
represented internally with 8-bit integers.)
A C<Str> type is a Unicode string object. It boxes a native C<str> (the
difference being in representation; a C<Str> is a P6opaque and as such you
may mix in to it, but this is not possible with a C<str>). A C<Str> functions
at grapheme level. This means that `.chars` should give the number of
graphemes, `.substr` should never cut a combining character in two, and so
forth. Both C<str> and C<Str> are immutable. Their exact representation in
memory is implementation defined, so implementations are free to use ropes
or other data structures internally in order to make concatenation, substring,
and so forth cheaper.

Implementation note: since Perl 6 mandates that C<Str> must view graphemes
as the fundamental unit rather than codepoints, this has some implications
regarding efficient implementation. It is suggested that all graphemes be
translated on input to unique grapheme numbers and represented as integers
within some kind of uniform array for fast substr access. For those
graphemes that have a precomposed form, use of that codepoint is suggested.
(Note that this means Latin-1 can still be represented internally with 8-bit
integers.)

For graphemes that have no precomposed form, a temporary private id should
be assigned that uniquely identifies the grapheme. If such ids are assigned
Expand All @@ -986,14 +944,14 @@ Maintaining a particular grapheme/id mapping over the life of the process
may have some GC implications for long-running processes, but most processes
will likely see a limited number of non-precomposed graphemes.

If the program has a scope that wants a codepoint view rather than a
grapheme view, the string visible to that lexical scope must also be
translated to universal form, just as with output translation. Alternately,
the temporary grapheme ids may be hidden behind an abstraction layer. In
any case, codepoint scope should never see any temporary grapheme ids. (The
lexical codepoint declaration should probably specify which normalization
form it prefers to view strings under. Such a declaration could be applied
to input translation as well.)
Code wishing to work at a codepoint level instead of a grapheme level
should use the C<Uni> type, which has subclasses representing the various
Unicode normalization forms (namely, C<NFC>, C<NFD>, C<NFIC>, and C<NFKD>).
Note that C<ord> is defined as a codepoint level operation. Even though the
C<Str> may contain synthetics internally, these should never be exposed by
C<ord>; instead, the behaviour should be as if the C<Str> had been converted
to an C<NFC> and then the first element accessed (obviously, implementations
are free to do something far more efficient).

=head2 The C<Buf> Type

Expand Down
14 changes: 7 additions & 7 deletions S03-operators.pod
Expand Up @@ -4008,13 +4008,13 @@ You can search a gather like this:
$lazystr ~~ /pattern/;

The C<Cat> interface allows the regex to match element boundaries
with the C<< <,> >> assertion, and the C<StrPos> objects returned by
the match can be broken down into elements index and position within
that list element. If the underlying data structure is a mutable
array, changes to the array (such as by C<shift> or C<pop>) are tracked
by the C<Cat> so that the element numbers remain correct. Strings,
arrays, lists, sequences, captures, and tree nodes can all be pattern
matched by regexes or by signatures more or less interchangeably.
with the C<< <,> >> assertion, and the C<Match> objects provide a way
to get both the element's index and the position within that list element.
If the underlying data structure is a mutable array, changes to the array
(such as by C<shift> or C<pop>) are tracked by the C<Cat> so that the element
numbers remain correct. Strings, arrays, lists, sequences, captures, and
tree nodes can all be pattern matched by regexes or by signatures more or
less interchangeably.

=head1 Invocant marker

Expand Down
5 changes: 1 addition & 4 deletions S05-regex.pod
Expand Up @@ -312,9 +312,6 @@ Note that this does not automatically anchor the pattern to the starting
location. (Use C<:p> for that.) The pattern you supply to C<split>
has an implicit C<:c> modifier.

String positions are of type C<StrPos> and should generally be treated
as opaque.

=item *

The C<:p> (or C<:pos>) modifier causes the pattern to try to match only at
Expand Down Expand Up @@ -1847,7 +1844,7 @@ The special named assertions include:
# \s+ if it's between two \w characters,
# \s* otherwise

/ <?at($pos)> / # match only at a particular StrPos
/ <?at($pos)> / # match only at a particular position
# short for <?{ .pos === $pos }>
# (considered declarative until $pos changes)

Expand Down
35 changes: 10 additions & 25 deletions S32-setting-library/Str.pod
Expand Up @@ -190,20 +190,16 @@ C<$str.encode('ISO-8859-1')> a C<blob8>.

=item index

multi method index( Str $string: Str $substring, StrPos $pos = StrPos(0) --> StrPos ) is export
multi method index( Str $string: Str $substring, Int $pos --> StrPos ) is export
multi method index( Str $string: Str $substring, Int $pos --> Int ) is export

C<index> searches for the first occurrence of C<$substring> in C<$string>,
starting at C<$pos>. If $pos is an C<Int>, it is taken to be in the units
of the calling scope, which defaults to "graphemes".
starting at C<$pos>.

The value returned is always a C<StrPos> object. If the substring
is found, then the C<StrPos> represents the position of the first
character of the substring. If the substring is not found, a bare
C<StrPos> containing no position is returned. This prototype C<StrPos>
evaluates to false because it's really a kind of undefined value. Do not evaluate
as a number, because instead of returning -1 it will return 0 and issue
a warning.
If the substring is found, then the C<Int> returned represents the position
of the first character of the substring. If the substring is not found, a bare
C<Int> is returned. This C<Int> type object evaluates to false because it's
really a kind of undefined value. Do not evaluate it as a number, because
instead of returning -1 it will return 0 and issue a warning.


=item pack
Expand Down Expand Up @@ -245,8 +241,7 @@ asked for it with .packformat or some such. -law]
=item rindex
X<rindex>

multi method rindex( Str $string: Str $substring, StrPos $pos? --> StrPos ) is export
multi method rindex( Str $string: Str $substring, Int $pos --> StrPos ) is export
multi method rindex( Str $string: Str $substring, Int $pos --> Int ) is export

Returns the position of the last C<$substring> in C<$string>. If C<$pos>
is specified, then the search starts at that location in C<$string>, and
Expand Down Expand Up @@ -422,20 +417,10 @@ C<sprintf($format, $p.key, $p.value)>.

=item substr

multi method substr (Str $string: StrPos $start, StrLen $length? --> Str ) is export
multi method substr (Str $string: StrPos $start, StrPos $end --> Str ) is export
multi method substr (Str $string: StrPos $start, Int $length --> Str ) is export
multi method substr (Str $string: Int $start, StrLen $length? --> Str ) is export
multi method substr (Str $string: Int $start, StrPos $end --> Str ) is export
multi method substr (Str $string: Int $start, Int $length --> Str ) is export
multi method substr (Str $string: Int $start, Int $length? --> Str ) is export

C<substr> returns part of an existing string. You control what part by
passing a starting position and optionally either an end position or length.
If you pass a number as either the position or length, then it will be used
as the start or length with the assumption that you mean "chars" in the
current Unicode abstraction level, which defaults to graphemes. A number
in the 3rd argument is interpreted as a length rather than a position (just
as in Perl 5).
passing a starting position and optionally a length.

Here is an example of its use:

Expand Down
4 changes: 0 additions & 4 deletions contents.pod
Expand Up @@ -42,10 +42,6 @@
C<Numeric> Types
Infinity and C<NaN>
Strings, the C<Str> Type
The C<StrPos> Type
The C<StrLen> Type
Units of Position Arguments
Numeric Coercion of C<StrPos> or C<StrLen>
The C<Buf> Type
Native C<buf> Types
The C<Whatever> Object
Expand Down

0 comments on commit 6e1c9d4

Please sign in to comment.