Add std.uni.byGrapheme and std.uni.byCodePoint #1736

JakobOvrum · 2013-11-29T09:35:19Z

Considered in response to the Unicode handling comparison thread on the NewsGroup.

byGrapheme eases string manipulation on graphemes by allowing range algorithms to work with graphemes rather than code points.
byCodePoint is a counterpart necessary for converting the result of range-based string manipulation on graphemes back to a string. e.g. graphemes.byCodePoint.text converts a range of graphemes to a string, and graphemes.byCodePoint.array converts a range of graphemes to a dchar[]. Of course, it's also useful on its own when a range of code points is accepted, such as when doing I/O: writeln(graphemes.byCodePoint);

The code example uses array before retro because byGrapheme is only a ForwardRange. Bidirectionality is technically possible to add, but byGrapheme simply builds on existing std.uni functionality (i.e. decodeGrapheme), which does not seem to support decoding graphemes from the back of a string.

If bidirectionality is added to byGrapheme, then byCodePoint should be edited to propagate it when available.

@blackwhale (and others), please destroy.

jacob-carlborg · 2013-11-29T14:03:42Z

std/uni.d

+
+        void popFront()
+        {
+            _front = _range.empty? Grapheme.init : _range.decodeGrapheme();


Shouldn't _range.empty? be an assert instead?

+1 just do assert(!empty) and do the decoding.

Wait a sec, I see that you use Grapheme.init as empty flag.
Personally I'd say use the boolean flag as it should be faster even if it makes byGrapheme range a bit bulkier.

Indeed, Grapheme.init is the empty flag. The idea is to formulate it so that front and empty are as cheap as possible, with the work done in popFront.

empty is now implemented with _front.length == 0.

space ( ) before the? operators plz :)

JakobOvrum · 2013-11-29T23:06:46Z

@blackwhale, any word on grapheme decoding from the back of strings, ala std.utf.strideBack?

DmitryOlshansky · 2013-11-30T08:15:41Z

The word is that decodeGrapheme should work fine with retro(range).
Indeed looking at genericDecodeGrapheme I see little point in creating decodeBackwards that will basically do s/front/back/, s/popFront/popBack/.

JakobOvrum · 2013-11-30T08:28:29Z

@blackwhale, the problem is that range needs to support bidirectionality for retro to work in the first place.

DmitryOlshansky · 2013-11-30T09:25:45Z

I'm not seeing that problem: if Range _range is bidirectional, then

auto r = retro(_range);
decodeGrapheme(r);

Should work. That being said I think we'd have to pay extra cost of the second grapheme slot.

Not pretty. It would be better if retro allowed ranges to provide their own 'get-a-reversed-range' call even if they are not bidirectional.
One simple way would be to do it via member function called retro.

JakobOvrum · 2013-12-02T07:58:05Z

@blackwhale, decodeGrapheme on a reversed range of code points works very differently from what you'd expect from decoding graphemes backwards.

With reverse grapheme decoding, one would want the reverse of "noël" to be "lëon", e.g. through the following hypothetical example:

writeln("noe\u0308l".byGrapheme.retro.byCodePoint); // Should print "lëon" in decomposed format

This is a simple consequence of the smallest unit (the element type) being a grapheme.

Compare that with using byGrapheme's forward iteration (i.e. decodeGrapheme) on a reversed range of code points:

writeln("noe\u0308l".retro.byGrapheme.map!(g => g[].array).joiner);

The above prints "l̈eon" - the combining diaeresis is attached to the "l" rather than the "e". I wouldn't expect decodeGrapheme to work in any other way, but it's no substitute for decoding graphemes backwards as illustrated by the first example.

JakobOvrum · 2013-12-02T08:28:08Z

Fixed forward-range propagation.

DmitryOlshansky · 2013-12-02T08:40:42Z

@JakobOvrum Yeah, I suddenly forgot that unlike with forward decoding that typically starts with "starter" it'd have to first take combination marks and such and end on a "starter". That means I'd have to reverse the grapheme cluster automation.

For now I suggest we simply postpone making it bidirectional to a separate pull.

JakobOvrum · 2013-12-02T08:46:44Z

For now I suggest we simply postpone making it bidirectional to a separate pull.

Yeah, it would be a completely additive change and I think byGrapheme is still useful enough as it is.

DmitryOlshansky · 2013-12-02T09:13:02Z

Otherwise it's LGTM, paging @monarchdodra to merge as he sees fit.

monarchdodra · 2013-12-03T10:58:51Z

LGTM, mostly. I'll take a day or two to fully review it.

I am looking at Grapheme's design though, and am a bit concerned with the whole being @trusted, yet seeing things like:

/++
Random-access range over Grapheme's $(CHARACTERS).

Warning: Invalidates when this Grapheme leaves the scope,
attempts to use it then would lead to memory corruption.
+/

Such design is really very dangerous, even in an @system scope. I really don't see this as anything but escaping reference to local, so in no way can it be @trusted. In particular, for something as innocuous as auto s = myGrapheme[];, which is usually expected to "just work", and the escaping reference mostly hidden. The only case where [] escapes a reference in all of D that I know of is static arrays. Even Array doesn't do it.

Furthermore, I'm also showing concerns about Grapheme having non-GC allocation, and a destructor at all, which basically means it is impossible to place it in a dynamic array, without massive leaks. I'm pretty sure one of the first things I'd do is a quick auto myGraphemes = myString.byGraphemes().array().

Well, this is outside of the scope of this pull, but for something we "hoped" would be simple, the design of Grapheme seems... unfit.

JakobOvrum · 2013-12-03T13:20:15Z

Ternary operator whitespace is now consistent with the rest of the module.

I am looking at Grapheme's design though, and am a bit concerned with the whole being @trusted

I think @trusted on types as well as @trusted: are bad practices. The number of @trusted functions should be minimized, and @trusted right there on the declaration serves as a warning flag to maintainers.

DmitryOlshansky · 2013-12-03T16:46:43Z

@monarchdodra TL;DR - let's think of making it @System. Shouldn't break much of code.

Such design is really very dangerous, even in an @System scope. I really don't see this as anything but escaping reference to local, so in no way can it be @trusted. In particular, for something as innocuous as auto s = myGrapheme[];, which is usually expected to "just work", and the escaping reference mostly hidden. The only case where [] escapes a reference in all of D that I know of is static arrays. Even Array doesn't do it.

I do agree with general sentiment. However I had very little choice here - we ABSOLUTELY do not want it to allocate for SMALL graphemes. And these are like 99.99% of cases (around 1-3 codepoints). And this means stepping into an uncharted (for Phobos, sadly) territory of small string optimization and stack allocation for small graphemes + deterministic destruction in case of long. Roughly speaking Grapheme is ~ std::string of C++. There is no way to make a range over that foolproof as it may point into stack.

There is basically no defined stance on small string optimization, value-typed small containers in general and no defined notion of invalidation of ranges over containers (when the latter go out of scope).

Furthermore, I'm also showing concerns about Grapheme having non-GC allocation, and a destructor at all, which basically means it is impossible to place it in a dynamic array, without massive leaks. I'm pretty sure one of the first things I'd do is a quick auto myGraphemes = myString.byGraphemes().array().

I hardly can say anything else then it's about bloody time to fix this in compiler/druntime. Because even knowing about it I tend to do things like arrays of RefCounted!T and whatnot and they must call dtors on finalize.If it's of any comfort only truly long graphems (>5 on 32bits) are malloced.

Add std.uni.byGrapheme and std.uni.byCodePoint

jacob-carlborg reviewed Nov 29, 2013
View reviewed changes

Add std.uni.byGrapheme and std.uni.byCodePoint

31a4357

monarchdodra added a commit that referenced this pull request Dec 5, 2013

Merge pull request #1736 from JakobOvrum/graphemerange

43901ec

Add std.uni.byGrapheme and std.uni.byCodePoint

monarchdodra merged commit 43901ec into dlang:master Dec 5, 2013

JakobOvrum deleted the graphemerange branch December 22, 2013 09:39

Uh oh!

Add std.uni.byGrapheme and std.uni.byCodePoint #1736

Add std.uni.byGrapheme and std.uni.byCodePoint #1736

Uh oh!

Conversation

JakobOvrum commented Nov 29, 2013

Uh oh!

jacob-carlborg Nov 29, 2013

Choose a reason for hiding this comment

Uh oh!

DmitryOlshansky Nov 29, 2013

Choose a reason for hiding this comment

Uh oh!

DmitryOlshansky Nov 29, 2013

Choose a reason for hiding this comment

Uh oh!

JakobOvrum Nov 29, 2013

Choose a reason for hiding this comment

Uh oh!

JakobOvrum Nov 30, 2013

Choose a reason for hiding this comment

Uh oh!

monarchdodra Dec 3, 2013

Choose a reason for hiding this comment

Uh oh!

JakobOvrum commented Nov 29, 2013

Uh oh!

DmitryOlshansky commented Nov 30, 2013

Uh oh!

JakobOvrum commented Nov 30, 2013

Uh oh!

DmitryOlshansky commented Nov 30, 2013

Uh oh!

JakobOvrum commented Dec 2, 2013

Uh oh!

JakobOvrum commented Dec 2, 2013

Uh oh!

DmitryOlshansky commented Dec 2, 2013

Uh oh!

JakobOvrum commented Dec 2, 2013

Uh oh!

DmitryOlshansky commented Dec 2, 2013

Uh oh!

monarchdodra commented Dec 3, 2013

Uh oh!

JakobOvrum commented Dec 3, 2013

Uh oh!

DmitryOlshansky commented Dec 3, 2013

Uh oh!

Uh oh!