-
-
Notifications
You must be signed in to change notification settings - Fork 740
Add std.uni.byGrapheme and std.uni.byCodePoint #1736
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
||
void popFront() | ||
{ | ||
_front = _range.empty? Grapheme.init : _range.decodeGrapheme(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't _range.empty?
be an assert
instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 just do assert(!empty) and do the decoding.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wait a sec, I see that you use Grapheme.init as empty flag.
Personally I'd say use the boolean flag as it should be faster even if it makes byGrapheme
range a bit bulkier.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, Grapheme.init
is the empty flag. The idea is to formulate it so that front
and empty
are as cheap as possible, with the work done in popFront
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
empty
is now implemented with _front.length == 0
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
space (
) before the?
operators plz :)
@blackwhale, any word on grapheme decoding from the back of strings, ala |
The word is that decodeGrapheme should work fine with retro(range). |
@blackwhale, the problem is that |
I'm not seeing that problem: if Range _range is bidirectional, then auto r = retro(_range); Should work. That being said I think we'd have to pay extra cost of the second grapheme slot. Not pretty. It would be better if retro allowed ranges to provide their own 'get-a-reversed-range' call even if they are not bidirectional. |
@blackwhale, With reverse grapheme decoding, one would want the reverse of "noël" to be "lëon", e.g. through the following hypothetical example: writeln("noe\u0308l".byGrapheme.retro.byCodePoint); // Should print "lëon" in decomposed format This is a simple consequence of the smallest unit (the element type) being a grapheme. Compare that with using writeln("noe\u0308l".retro.byGrapheme.map!(g => g[].array).joiner); The above prints "l̈eon" - the combining diaeresis is attached to the "l" rather than the "e". I wouldn't expect |
Fixed forward-range propagation. |
@JakobOvrum Yeah, I suddenly forgot that unlike with forward decoding that typically starts with "starter" it'd have to first take combination marks and such and end on a "starter". That means I'd have to reverse the grapheme cluster automation. For now I suggest we simply postpone making it bidirectional to a separate pull. |
Yeah, it would be a completely additive change and I think |
Otherwise it's LGTM, paging @monarchdodra to merge as he sees fit. |
LGTM, mostly. I'll take a day or two to fully review it. I am looking at /++
Random-access range over Grapheme's $(CHARACTERS).
Warning: Invalidates when this Grapheme leaves the scope,
attempts to use it then would lead to memory corruption.
+/ Such design is really very dangerous, even in an Furthermore, I'm also showing concerns about Well, this is outside of the scope of this pull, but for something we "hoped" would be simple, the design of |
Ternary operator whitespace is now consistent with the rest of the module.
I think |
@monarchdodra TL;DR - let's think of making it @System. Shouldn't break much of code.
I do agree with general sentiment. However I had very little choice here - we ABSOLUTELY do not want it to allocate for SMALL graphemes. And these are like 99.99% of cases (around 1-3 codepoints). And this means stepping into an uncharted (for Phobos, sadly) territory of small string optimization and stack allocation for small graphemes + deterministic destruction in case of long. Roughly speaking Grapheme is ~ std::string of C++. There is no way to make a range over that foolproof as it may point into stack. There is basically no defined stance on small string optimization, value-typed small containers in general and no defined notion of invalidation of ranges over containers (when the latter go out of scope).
I hardly can say anything else then it's about bloody time to fix this in compiler/druntime. Because even knowing about it I tend to do things like arrays of RefCounted!T and whatnot and they must call dtors on finalize.If it's of any comfort only truly long graphems (>5 on 32bits) are malloced. |
Add std.uni.byGrapheme and std.uni.byCodePoint
Considered in response to the Unicode handling comparison thread on the NewsGroup.
byGrapheme
eases string manipulation on graphemes by allowing range algorithms to work with graphemes rather than code points.byCodePoint
is a counterpart necessary for converting the result of range-based string manipulation on graphemes back to a string. e.g.graphemes.byCodePoint.text
converts a range of graphemes to astring
, andgraphemes.byCodePoint.array
converts a range of graphemes to adchar[]
. Of course, it's also useful on its own when a range of code points is accepted, such as when doing I/O:writeln(graphemes.byCodePoint);
The code example uses
array
beforeretro
becausebyGrapheme
is only aForwardRange
. Bidirectionality is technically possible to add, butbyGrapheme
simply builds on existingstd.uni
functionality (i.e.decodeGrapheme
), which does not seem to support decoding graphemes from the back of a string.If bidirectionality is added to
byGrapheme
, thenbyCodePoint
should be edited to propagate it when available.@blackwhale (and others), please destroy.