add byCodeunit, byChar, byWchar and byDchar to std.utf #2043

WalterBright · 2014-03-23T22:19:50Z

These should form the foundation for higher performance string processing, as they:

do not auto-decode
are pure/nothrow/safe, depending on the input parameter
are lazy
do not allocate memory
replace invalid UTF sequences with U+FFFD, per the Unicode Standard 6.2, rather than throwing or asserting
are composable - completely compatible with ranges and algorithms

Unit tests are at 100%.

ghost · 2014-03-23T22:30:07Z

std/utf.d

+
+auto byCodeunit(R)(R r) if (isSomeString!R)
+{
+    alias tchar = Unqual!(ElementEncodingType!R);


This isn't used in this function.

bearophile · 2014-03-23T22:49:02Z

If I use "SomeText".byCodeUnit.map!(c => c) what is the type of the given output items? Is it a dchar again?

WalterBright · 2014-03-23T23:10:09Z

@bearophile it's char

dnadlinger · 2014-03-23T23:18:12Z

std/utf.d

+/********************************************
+ * Iterate a range of char, wchar, or dchars by code unit.
+ * The purpose is to bypass the special case decoding that
+ * std.array.put does to character arrays.


Only std.array.put, not front? And as far as I can remember, put doesn't actually special-case character arrays, it just offers transcoding support for character ranges in general.

It doesn't, but could. As a matter of fact, it's something I'd like to change. Once I get the time to.

I meant front().

jacob-carlborg · 2014-03-24T13:51:10Z

None of the Ddoc comments use any of the standard sections, i.e. Params, Returns and so on.

jacob-carlborg · 2014-03-24T13:53:46Z

std/utf.d

+        @property bool empty() { return r.length == 0; }
+        @property auto front() { return r[0]; }
+        void popFront()        { r = r[1 .. $]; }
+        auto opIndex(size_t index) const { return r[index]; }


I'm not so sure I like the style of these methods, putting everything on a single line.

I prefer it for trivial functions. Otherwise things tend to get spread out, making it harder to digest at a glance.

ditto here about inout and ref.

jacob-carlborg · 2014-03-24T14:00:30Z

I would like to see some descriptions/annotations for the unit tests.

DmitryOlshansky · 2014-03-24T16:14:22Z

@WalterBright see also pull #2038, and add a test case for bad UTF-8 with values above 0x10_FFFF.

WalterBright · 2014-03-24T18:13:06Z

This PR addresses https://d.puremagic.com/issues/show_bug.cgi?id=12113

DmitryOlshansky · 2014-03-24T18:17:26Z

This PR addresses https://d.puremagic.com/issues/show_bug.cgi?id=12113

Point being that this pull has the same bug in UTF-8 decoding. I could make a separate bug report for these new primitives, but it's kind of meh.

WalterBright · 2014-03-24T20:11:59Z

Point being that this pull has the same bug in UTF-8 decoding.

The same bug? What am I missing?

WalterBright · 2014-03-24T20:14:06Z

Updated to address comments.

DmitryOlshansky · 2014-03-24T20:24:39Z

std/utf.d

+                            }
+
+                            // check for out of range only needed for 4 bytes
+                            static if (i == 3)


@WalterBright
I've meant this check. It seems that either you've fixed it or I've missed that it was here in the first place.
Anyway, all good.

I had missed it until you pointed it out. It's fixed now, with a unittest.

WalterBright · 2014-03-26T03:12:55Z

@blackwhale and @monarchdodra, you are both right, @andralex I misled you. I was using the wrong test case. Will change it back.

WalterBright · 2014-03-26T04:36:29Z

@monarchdodra no, U+FFFF is not checked for. It's a bit in limbo, being reserved for the application.

monarchdodra · 2014-03-26T09:15:59Z

@monarchdodra no, U+FFFF is not checked for. It's a bit in limbo, being reserved for the application.

I don't think it'll happen very often, but it could be trivially fixed by setting frontChar's .init value to an actually illegal and checked for value, such as U+D800.

WalterBright · 2014-03-26T18:56:16Z

@monarchdodra I did a fair amount of thinking about that. We've used U+FFFF as a 'nan' value from the beginning without any reported issues with it. U+FFFF will remain in the transcoded UTF stream as U+FFFF, it won't get lost or missed. U+FFFF is not a transcoding error. U+D800 is not an "actually illegal" value according to the spec, and I think it would be a mistake to mix it up with U+FFFF. Etc.

monarchdodra · 2014-03-26T21:13:04Z

@monarchdodra I did a fair amount of thinking about that. We've used U+FFFF as a 'nan' value from the beginning without any reported issues with it. U+FFFF will remain in the transcoded UTF stream as U+FFFF, it won't get lost or missed. U+FFFF is not a transcoding error. U+D800 is not an "actually illegal" value according to the spec, and I think it would be a mistake to mix it up with U+FFFF. Etc.

You have me misunderstood, and I agree with you. I'm not suggesting we replace U+FFFF or anything. I agree it is not an 'illegal' value, and should remain in the stream. I'm just pointing out you are also using the value U+FFFF (dchar.init) as a flag to mean "front not yet evaluated". As such, if there does happen to be a U+FFFF in your stream, calling front more than once in a row will pop it. In essence, you'd get 2 different values for two consecutive calls to front.

I'm instead suggesting you choose the value U+D800 as the "not yet evaluated" value: If there are any U+D800 values in your stream, they are flagged as "invalid", and replaced by U+FFFD. In essence, this makes it impossible to "have" U+D800 as a front, avoiding the issue above.

It's just an implementation tweak, not a change of behavior/handling for U+FFFF.

WalterBright · 2014-05-13T04:25:55Z

Updated so that:

empty needn't be called
front can be called repeatedly

monarchdodra · 2014-05-17T21:26:39Z

std/utf.d

+    static struct ByCodeUnitImpl
+    {
+        @property bool empty() const         { return r.length == 0; }
+        @property auto ref front() const     { return r[0]; }


This (and opIndex) should be inout, or you won't actually be able to mutate the returned reference:

auto s = "hello".dup.byCodeUnit(); s.front = 'H';

Unless that was done on purpose?

monarchdodra · 2014-05-17T21:29:37Z

I gave one final review to the entire pull. I like the new implementations. It looks good to ship as is.

I left some nitpicks; it would be nice if you could address them, but they aren't "blocker" issues.

John-Colvin · 2014-05-21T22:16:28Z

Doesn't this fail miserably with std.range.popFrontN ??

I'm not convinced that I like the whole design here*, but at the very least you need to have an accumulator for how many elements front should remove when called.

*take a range of char r1 that blocks on popFront waiting for data from network or expensive calculation, with it's current front cached.

auto r2 = r1.byDchar.
if(!r2.empty)
{
    auto e = r2.front; //Woops, access to an already known value is blocked
                       //waiting on getting the *next* data point.
}

monarchdodra · 2014-05-22T08:54:38Z

Doesn't this fail miserably with std.range.popFrontN ??

How so? No any worst than with a string...?

I'm not convinced that I like the whole design here*, but at the very least you need to have an accumulator for how many elements front should remove when called.

I don't see how byDchar is really any different from any other input range adaptor, such as filter, splitter or whatever. Simply put, yes, there is not a 1:1 mapping in terms of elements popped, but I don't see that as a problem.

auto r2 = r1.byDchar.
if(!r2.empty)
{
    auto e = r2.front; //Woops, access to an already known value is blocked
                       //waiting on getting the *next* data point.
}

Again, I don't see how that's different from any other range addaptor. Besides, this could block also block at if(!r2.empty). At the end of the day, if the range/stream halts waiting for data, then a wait will occur. !empty != data is in buffer.

John-Colvin · 2014-05-22T09:37:32Z

auto s = "fdsa";
s.popFrontN(2);
assert(equal(s, "sa"));
s = "fdsa";
auto sByDchar = s.byDchar();
sByDchar.popFrontN(2);
assert(equal(sByDChar, "sa")); //fails, only the 'f' has been popped.

John-Colvin · 2014-05-22T09:41:58Z

I think we still have a serious problem with the semantic specification of ranges and their primitives.

monarchdodra · 2014-05-22T09:44:41Z

//fails, only the 'f' has been popped.

Indeed, I missed that. There's an implementation problem here.

I think we still have a serious problem with the semantic specification of ranges.

Just the implementation I think.

monarchdodra · 2014-05-22T09:46:23Z

BTW, I hadn't suggested yet, but we might want to split this pull between byCodeUnit and byXChar. I think byCodeUnit is much more important, and trivial, and virtually ready to go.

John-Colvin · 2014-05-22T09:47:53Z

BTW, I hadn't suggested yet, but we might want to split this pull between byCodeUnit and byXChar. I think byCodeUnit is much more important, and trivial, and virtually ready to go.
I agree. My problems are with the by*char ranges, byCodeUnit looks fine.

WalterBright · 2014-05-27T06:40:28Z

did all the issues here

monarchdodra · 2014-05-29T10:27:05Z

Anybody see any outstanding (non-style related) issues left? @John-Colvin maybe? Will soon merge otherwise.

John-Colvin · 2014-05-29T16:15:13Z

I still think a range wrapper like byDchar shouldn't advance it's inner range by more than is necessary to calculate the current front. They should maximally lazy w.r.t. the amount of data they request, to avoid unecessary latency.

Having said that, I see no problem merging this how it is, it can always be changed later (becoming more lazy here isn't a breaking change unelss someone is relying on undocumented implementation details).

monarchdodra · 2014-05-29T17:23:03Z

I still think a range wrapper like byDchar shouldn't advance it's inner range by more than is necessary to calculate the current front.

Isn't that what it's doing though? I see most of the work being done in front? popFront only does any work if it's called and front has not yet been evaluated...

John-Colvin · 2014-05-29T20:33:01Z

Talking specifically about byDchar for a range of char: Calling front causes the inner range to be popped such that it's front points to the first char not in the current code-point. For an inner range that updates its state in popFront, you are requesting more information from the original data-source than is currently needed.

Sometimes this wouldn't be a problem, but anything that handles user input events in realtime can't afford to be blocked waiting for the next input in order to access the previous one,

Once a wrapper uses eager popping it requires a redesign of the inner range (or another intermediate range that defers the work in popFront to front) to restore the fully lazy behaviour and avoid latency problems. It's inelegant and leads to a rats nest of guard variables that confuse the optimiser and increase register pressure.

WalterBright · 2014-05-30T00:16:18Z

@John-Colvin popFront() does not have the behavior you describe - it does not wait for more input. It 'consumes' the current input.

DmitryOlshansky · 2014-05-30T06:00:44Z

I kind of like it but turns out that I, for instance, can't use anything of this in std.regex.

Reasons:

no tail-const as in slices, so I can't forward byCodeUnit!(string) to byCodeUnit!(const(char)[])
decoding is baked into a range - can't mix and match iterating by say char and dchar on the same string (though I'm getting rid of that in std.regex soon enough)
anyhow C's run-time memchr still needs to peek at naked character (or rather byte[]) arrays

All of the above makes it next to useless in my use cases, might as well for other parser/lexer stuff.

monarchdodra · 2014-06-04T14:44:56Z

@John-Colvin popFront() does not have the behavior you describe - it does not wait for more input. It 'consumes' the current input.

I think he means that front calls popFront() at the end of its execution, and then popFront() itself does nothing. So arguably, the underlying range is popped just a little bit eagerly.

I'm not sure that's a problem though.

John-Colvin · 2014-06-04T16:52:37Z

@WalterBright You misunderstand me. I am referring to byDchar.front calling r.popFront more eagerly than is needed.

WalterBright · 2014-06-06T01:20:01Z

@John-Colvin ok, done.

monarchdodra · 2014-06-07T12:41:20Z

Auto-merge toggled on

monarchdodra · 2014-06-07T12:42:05Z

At this point, I think this is good enough for inclusion. Anything left we can fix later.

add byCodeunit, byChar, byWchar and byDchar to std.utf

WalterBright · 2014-06-07T18:07:35Z

thanks!

monarchdodra · 2014-06-16T14:57:16Z

std/utf.d

+            {
+                if (!haveData)
+                    front;
+                r.popFront();


Hum... looks like we were a little fast with the whole "lazy pop thing":
http://forum.dlang.org/thread/trqnqtzspoyhggvvftgp@forum.dlang.org

Long story short, if the last character in the string is actually truncated, then the popFront() of the underlying range will fail, since it will already be empty.

A "fix" would be to add:

if (!r.empty) r.popFront();

But that check could be more costly than what we are trying to save on? It needs to be fixed one way or the other. @WalterBright ?

I didn't check if byWchar is also subject to the issue.

Since this has been merged, please file bugzilla issues for problems with it.

sinkuu · 2014-09-26T14:10:27Z

ByCodeUnitImpl.opSlice returns a slice to the original array. Shouldn't it return the same type as the range itself?(as hasSlicing requires)

Edit: filed a bug https://issues.dlang.org/show_bug.cgi?id=13535

quickfur · 2014-09-26T14:43:08Z

Please file a bug for this, since this PR is already merged.

ghost reviewed Mar 23, 2014
View reviewed changes

std/utf.d

auto byCodeunit(R)(R r) if (isSomeString!R)

{

alias tchar = Unqual!(ElementEncodingType!R);

Copy link

ghost Mar 23, 2014

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't used in this function.

dnadlinger reviewed Mar 23, 2014
View reviewed changes

jacob-carlborg reviewed Mar 24, 2014
View reviewed changes

DmitryOlshansky reviewed Mar 24, 2014
View reviewed changes

monarchdodra reviewed May 17, 2014
View reviewed changes

add byCodeunit, byChar, byWchar and byDchar to std.utf

0cfc022

monarchdodra added a commit that referenced this pull request Jun 7, 2014

Merge pull request #2043 from WalterBright/byChar

5319539

add byCodeunit, byChar, byWchar and byDchar to std.utf

monarchdodra merged commit 5319539 into dlang:master Jun 7, 2014

WalterBright deleted the byChar branch June 7, 2014 18:04

monarchdodra reviewed Jun 16, 2014
View reviewed changes

Uh oh!

add byCodeunit, byChar, byWchar and byDchar to std.utf #2043

add byCodeunit, byChar, byWchar and byDchar to std.utf #2043

Uh oh!

Conversation

WalterBright commented Mar 23, 2014

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bearophile commented Mar 23, 2014

Uh oh!

WalterBright commented Mar 23, 2014

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jacob-carlborg commented Mar 24, 2014

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jacob-carlborg commented Mar 24, 2014

Uh oh!

DmitryOlshansky commented Mar 24, 2014

Uh oh!

WalterBright commented Mar 24, 2014

Uh oh!

DmitryOlshansky commented Mar 24, 2014

Uh oh!

WalterBright commented Mar 24, 2014

Uh oh!

WalterBright commented Mar 24, 2014

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

WalterBright commented Mar 26, 2014

Uh oh!

WalterBright commented Mar 26, 2014

Uh oh!

monarchdodra commented Mar 26, 2014

Uh oh!

WalterBright commented Mar 26, 2014

Uh oh!

monarchdodra commented Mar 26, 2014

Uh oh!

WalterBright commented May 13, 2014

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

monarchdodra commented May 17, 2014

Uh oh!

John-Colvin commented May 21, 2014

Uh oh!

monarchdodra commented May 22, 2014

Uh oh!

John-Colvin commented May 22, 2014

Uh oh!

John-Colvin commented May 22, 2014

Uh oh!

monarchdodra commented May 22, 2014

Uh oh!

monarchdodra commented May 22, 2014

Uh oh!

John-Colvin commented May 22, 2014

Uh oh!

WalterBright commented May 27, 2014

Uh oh!

monarchdodra commented May 29, 2014

Uh oh!

John-Colvin commented May 29, 2014