Skip to content

Conversation

JakobOvrum
Copy link
Contributor

Considered in response to the Unicode handling comparison thread on the NewsGroup.

  • byGrapheme eases string manipulation on graphemes by allowing range algorithms to work with graphemes rather than code points.
  • byCodePoint is a counterpart necessary for converting the result of range-based string manipulation on graphemes back to a string. e.g. graphemes.byCodePoint.text converts a range of graphemes to a string, and graphemes.byCodePoint.array converts a range of graphemes to a dchar[]. Of course, it's also useful on its own when a range of code points is accepted, such as when doing I/O: writeln(graphemes.byCodePoint);

The code example uses array before retro because byGrapheme is only a ForwardRange. Bidirectionality is technically possible to add, but byGrapheme simply builds on existing std.uni functionality (i.e. decodeGrapheme), which does not seem to support decoding graphemes from the back of a string.

If bidirectionality is added to byGrapheme, then byCodePoint should be edited to propagate it when available.

@blackwhale (and others), please destroy.


void popFront()
{
_front = _range.empty? Grapheme.init : _range.decodeGrapheme();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't _range.empty? be an assert instead?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 just do assert(!empty) and do the decoding.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait a sec, I see that you use Grapheme.init as empty flag.
Personally I'd say use the boolean flag as it should be faster even if it makes byGrapheme range a bit bulkier.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, Grapheme.init is the empty flag. The idea is to formulate it so that front and empty are as cheap as possible, with the work done in popFront.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

empty is now implemented with _front.length == 0.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

space ( ) before the? operators plz :)

@JakobOvrum
Copy link
Contributor Author

@blackwhale, any word on grapheme decoding from the back of strings, ala std.utf.strideBack?

@DmitryOlshansky
Copy link
Member

The word is that decodeGrapheme should work fine with retro(range).
Indeed looking at genericDecodeGrapheme I see little point in creating decodeBackwards that will basically do s/front/back/, s/popFront/popBack/.

@JakobOvrum
Copy link
Contributor Author

@blackwhale, the problem is that range needs to support bidirectionality for retro to work in the first place.

@DmitryOlshansky
Copy link
Member

I'm not seeing that problem: if Range _range is bidirectional, then

auto r = retro(_range);
decodeGrapheme(r);

Should work. That being said I think we'd have to pay extra cost of the second grapheme slot.

Not pretty. It would be better if retro allowed ranges to provide their own 'get-a-reversed-range' call even if they are not bidirectional.
One simple way would be to do it via member function called retro.

@JakobOvrum
Copy link
Contributor Author

@blackwhale, decodeGrapheme on a reversed range of code points works very differently from what you'd expect from decoding graphemes backwards.

With reverse grapheme decoding, one would want the reverse of "noël" to be "lëon", e.g. through the following hypothetical example:

writeln("noe\u0308l".byGrapheme.retro.byCodePoint); // Should print "lëon" in decomposed format

This is a simple consequence of the smallest unit (the element type) being a grapheme.

Compare that with using byGrapheme's forward iteration (i.e. decodeGrapheme) on a reversed range of code points:

writeln("noe\u0308l".retro.byGrapheme.map!(g => g[].array).joiner);

The above prints "l̈eon" - the combining diaeresis is attached to the "l" rather than the "e". I wouldn't expect decodeGrapheme to work in any other way, but it's no substitute for decoding graphemes backwards as illustrated by the first example.

@JakobOvrum
Copy link
Contributor Author

Fixed forward-range propagation.

@DmitryOlshansky
Copy link
Member

@JakobOvrum Yeah, I suddenly forgot that unlike with forward decoding that typically starts with "starter" it'd have to first take combination marks and such and end on a "starter". That means I'd have to reverse the grapheme cluster automation.

For now I suggest we simply postpone making it bidirectional to a separate pull.

@JakobOvrum
Copy link
Contributor Author

For now I suggest we simply postpone making it bidirectional to a separate pull.

Yeah, it would be a completely additive change and I think byGrapheme is still useful enough as it is.

@DmitryOlshansky
Copy link
Member

Otherwise it's LGTM, paging @monarchdodra to merge as he sees fit.

@monarchdodra
Copy link
Collaborator

LGTM, mostly. I'll take a day or two to fully review it.


I am looking at Grapheme's design though, and am a bit concerned with the whole being @trusted, yet seeing things like:

/++
Random-access range over Grapheme's $(CHARACTERS).

Warning: Invalidates when this Grapheme leaves the scope,
attempts to use it then would lead to memory corruption.
+/

Such design is really very dangerous, even in an @system scope. I really don't see this as anything but escaping reference to local, so in no way can it be @trusted. In particular, for something as innocuous as auto s = myGrapheme[];, which is usually expected to "just work", and the escaping reference mostly hidden. The only case where [] escapes a reference in all of D that I know of is static arrays. Even Array doesn't do it.

Furthermore, I'm also showing concerns about Grapheme having non-GC allocation, and a destructor at all, which basically means it is impossible to place it in a dynamic array, without massive leaks. I'm pretty sure one of the first things I'd do is a quick auto myGraphemes = myString.byGraphemes().array().

Well, this is outside of the scope of this pull, but for something we "hoped" would be simple, the design of Grapheme seems... unfit.

@JakobOvrum
Copy link
Contributor Author

Ternary operator whitespace is now consistent with the rest of the module.

I am looking at Grapheme's design though, and am a bit concerned with the whole being @trusted

I think @trusted on types as well as @trusted: are bad practices. The number of @trusted functions should be minimized, and @trusted right there on the declaration serves as a warning flag to maintainers.

@DmitryOlshansky
Copy link
Member

@monarchdodra TL;DR - let's think of making it @System. Shouldn't break much of code.

Such design is really very dangerous, even in an @System scope. I really don't see this as anything but escaping reference to local, so in no way can it be @trusted. In particular, for something as innocuous as auto s = myGrapheme[];, which is usually expected to "just work", and the escaping reference mostly hidden. The only case where [] escapes a reference in all of D that I know of is static arrays. Even Array doesn't do it.

I do agree with general sentiment. However I had very little choice here - we ABSOLUTELY do not want it to allocate for SMALL graphemes. And these are like 99.99% of cases (around 1-3 codepoints). And this means stepping into an uncharted (for Phobos, sadly) territory of small string optimization and stack allocation for small graphemes + deterministic destruction in case of long. Roughly speaking Grapheme is ~ std::string of C++. There is no way to make a range over that foolproof as it may point into stack.

There is basically no defined stance on small string optimization, value-typed small containers in general and no defined notion of invalidation of ranges over containers (when the latter go out of scope).

Furthermore, I'm also showing concerns about Grapheme having non-GC allocation, and a destructor at all, which basically means it is impossible to place it in a dynamic array, without massive leaks. I'm pretty sure one of the first things I'd do is a quick auto myGraphemes = myString.byGraphemes().array().

I hardly can say anything else then it's about bloody time to fix this in compiler/druntime. Because even knowing about it I tend to do things like arrays of RefCounted!T and whatnot and they must call dtors on finalize.If it's of any comfort only truly long graphems (>5 on 32bits) are malloced.

monarchdodra added a commit that referenced this pull request Dec 5, 2013
Add std.uni.byGrapheme and std.uni.byCodePoint
@monarchdodra monarchdodra merged commit 43901ec into dlang:master Dec 5, 2013
@JakobOvrum JakobOvrum deleted the graphemerange branch December 22, 2013 09:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants