UTF decoding optimizations #299

MartinNowak · 2011-10-22T03:32:56Z

Use fast path tests for non-complex unicode sequences that can be
inlined. These rely on the built-in array bounds check.
Factor out complex cases into separate functions that do exception
based validity checks. The char[] and wchar[] versions use
pointers to avoid redundant array bounds checks, thus they can only
be trusted.
Complete rewrite of decode for char[] to use less branching and
unrolled loops. This requires less registers AND less instructions.
The overlong check is done much cheaper on the code point.
The decode functions were made templates to short circuit the very
restricted function inlining possibilities.

As a rough number, I get a 2x-4x speedup in streamed string decoding,
even for unicode heavy input.

We should think of moving these functions to druntime
to avoid the duplication.

DmitryOlshansky · 2011-10-25T22:27:56Z

std/utf.d

+dchar decode(R:const(char[]))(R str, ref size_t index) @trusted pure
+in
+{
+    assert(index < str.length);


It's a nitpick, but some kind of message should be here, right?
Basically around the same as in std.range/std.array.

DmitryOlshansky · 2011-10-25T22:36:57Z

Nice job here. Other then these tiny messages bit, it looks great.
And 2-4x times performance, just wow :)

ghost · 2011-10-26T00:31:24Z

It would be orders of magnitude faster if you got rid of those exceptions. It's nice being able to continue decoding and replace invalid code points with something useful on the screen (e.g. Scintilla shows hex bytes for invalid code points). But with exceptions this function is too slow to be useful. In my own implementation the decode function takes an extra ref bool valid parameter, which I can check in a loop and then insert an "invalid character mark" and continue decoding. For large text files the difference in speed is colossal.

MartinNowak · 2011-10-26T14:33:00Z

I don't get your point.
Are you complaining that decoding tons of invalid code points is slow?

Exceptions are no-ops in a no error case just a bunch of conditional jump
which you also need for a boolean error.
Also you would need to manually resync the utf stream to the next valid start
and it remains questionable if one should continue to decode a corrupt stream at all.
Where is your decode function?

ghost · 2011-10-26T16:46:48Z

You're right about your claim about lots of invalid points. I was testing reading this file:
http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt

With exceptions I need around ~150 msecs to parse the file and skip invalid code points with the old decode function. With my own decode function the average time is 2 msecs. The diff between the two functions is that my decode's header is dchar mydecode(in char[] s, ref size_t idx, ref bool valid), the bool is set to false by a goto:

  Lerr:
    valid = false;
    return 0;

If I use decode from this pull request the time jumps to ~350 msecs for that UTF test file. Maybe the real issue is D's slow exceptions. Anyway that's testing a file with a lot of invalid sequences, so all the exceptions being thrown add up the time.

A different issue I have is that decode seems to add a constant factor of 30msecs to loading a valid utf file which never triggers any exceptions. I don't know what could be causing this. decode from this pull adds 20 msecs instead.

Anyway, I'll try to make a small platform-independent test-case later so I can confirm my findings, I don't want to block this pull since it could very well be my fault.

MartinNowak · 2011-10-26T17:10:55Z

I've check all 2^8 1-char, 2^16 2-char, 2^24 3-char and 2^32 4-char inputs against
the old implementation, they even throw the same exception messages.
Furthermore i checked decoding of some million random char, wchar and dchar arrays
with the same result.

The only change, which is done by design, is that the enforcement index < str.length is replaced
with an assertion. The assertion is in the templated code so it will react to the user build configuration.

Overall I'm pretty confident that this introduces no regression.

CyberShadow · 2011-10-28T23:14:18Z

These rely on the built-in array bounds check.

I didn't look too closely at the code, so this question might not make any sense at all: what happens if Phobos is rebuilt with -release (or the respective function is turned into a template and is built as part of a program with -release)? Is there a change of segfaults / data corruption / security vulnerabilities?

MartinNowak · 2011-10-29T07:29:42Z

It now behaves as does .front and .popFront for arrays.

noflags : AssertError w message
release : RangeError
release noboundscheck: undefined behavior

Note: This only applies to the element at index.
Every further decoding is explicitly enforced, because there
are no means to check it beforehand.

CyberShadow · 2011-10-29T12:47:06Z

Strings are present in most forms of user input, so I think we should be extra-careful here. It's not unimaginable that a programmer could disable bounds checking as an overzealous optimization.

When you say "undefined behavior", what's the worst that could happen (going from no-side-effects to remote code execution)?

CyberShadow · 2011-10-29T13:25:43Z

Another issue: Do I understand correctly that the changes make the code throw Error-derived classes?

Error-derived classes indicate unrecoverable errors. Validation failure of an invalid UTF-8 string should not be an unrecoverable error. Catching errors and throwing exceptions seems nefarious as well...

MartinNowak · 2011-10-29T16:39:44Z

No they are still UTFExceptions.
To summarize again, the function behaves exactly as the old one. Especially in the presence of corrupt utf sequences or too short input.

The only difference is accessing the first code unit with index >= str.length.
This will now behave the same as an unchecked array.front access, that is
you will get an AssertError/RangeError or unchecked memory access in
descending order of disabled checks.

auto str = "Hello";
size_t index = 20;
std.utf.decode(str, index)

CyberShadow · 2011-10-29T16:40:52Z

OK, thanks for the clarification :)

jmdavis · 2011-11-06T09:29:51Z

std/utf.d

@@ -549,92 +550,103 @@ size_t toUTFindex(in dchar[] str, size_t n) @safe pure nothrow
        $(D UTFException) if $(D str[index]) is not the start of a valid UTF
        sequence.
  +/
-dchar decode(in char[] str, ref size_t index) @safe pure
+dchar decode(R:const(char[]))(R str, ref size_t index) @trusted pure


I'd argue that this should be something like

dchar decode(C)(C[] str, ref size_t index) @ trusted pure if(is(Unqual!C == char))

It's far more typical in Phobos to use template constraints. I know that Andrei considers that to generally be a better choice, and I agree with him.

D'accord.

It needs to test for implicit conversion to what
previously was the overloaded parameter or would break code.

dchar decode(S)(S str, ref size_t index) @trusted pure if(is(S : const(char[]))

No, what I suggested works just fine. C[] where is(Unqual!C == char) is true, will work with any array of char. You can pass const char[] to const(char)[] just fine, because the original array is unaltered (since it isn't passed in to the function - a slice of it is) and the elements are still const.

Unfortunately it does not, which is why I went with specialization in the first place.
For string enums the array element deduction will fail.

enum XYZ : string { a = "foo" }; size_t i; decode(XYZ.a, i);

XYZ can't be matched with C[].

Probably the compiler could do better here, but currently it would add a breaking change.

Hmmm. Well, then it should probably do what some of std.string does, which would be something like

dchar decode(C)(const(C)[] str, ref size_t index) @trusted pure if(is(Unqual!C == char))

And actually, it arguably should be doing something like that anyway, since in general, it's better to make parameters const when they can be. With in, it was const, but with your changes, it's not. It works with both const and non-const, but it's not const for both. And even if that doesn't work for some reason, it's probably still better to add const to what you have.

That won't work either, the first parameter is of type XYZ. The compiler can't deduce an array element type
from that. I've added a qualifier though.

dchar decode(S)(in S str, ref size_t index) @trusted pure if(is(S : const(char[]))

If it can't, then it should probably be reported as a bug, but what you have works, and if the compiler won't let it work the other way, then it won't let it work the other way - bug or not. So, it's fine as-is then.

jmdavis · 2011-11-07T03:00:14Z

You're going to need to rebase this before it gets committed. It's fine as it is for review purposes, but there are too many small commits for it to be merged in as-is.

braddr · 2011-11-07T03:55:35Z

jmdavis: I strongly disagree. I much prefer to see small commits that are trivial to understand and trivial to review.

jmdavis · 2011-11-07T04:15:08Z

I'm not saying that all pull requests should be a single commit or anything like that. I'm saying that you shouldn't have a bunch of commits that change only a few lines - especially when it's only one line. Also, it's not infrequent that there are suggestions in reviews which result in changes which would just be cleaner if the commits were rearranged so that the change is in the original commit. Take the "rename exception factory" commit for instance. If you change the original commit which introduced that function so that it was name exception instead of error, it's just as clear, and you have one fewer commits.

Commits should be broken up logically such that it's easier to understand which changes are being made - huge commits with tons of changes are problematic - but having a ton of small commits which could be merged together to result in cleaner commits which are just as understandable - if not more so - is not a good idea IMHO. In such cases, it's better to merge those commits.

- Use fast path tests for non-complex unicode sequences that can be inlined. These rely on the built-in array bounds check. - Factor out complex cases into separate functions that do exception based validity checks. The char[] and wchar[] versions use pointers to avoid redundant array bounds checks, thus they can only be trusted. - Complete rewrite of decode for char[] to use less branching and unrolled loops. This requires less registers AND less instructions. The overlong check is done much cheaper on the code point. - The decode functions were made templates to short circuit the very restricted function inlining possibilities.

jmdavis · 2011-11-07T06:38:14Z

As far as I can see, this is fine now, but I'm not going to merge it in at the moment, because the Phobos unit tests aren't currently building due to a recent commit, and I don't like to merge stuff in when we can't verify that it's not breaking something.

UTF decoding optimizations

jmdavis · 2011-11-07T10:00:47Z

Merged.

Improve logging accuracy merged-on-behalf-of: Sebastian Wilzbach <sebi.wilzbach@gmail.com>

DmitryOlshansky reviewed Oct 25, 2011
View reviewed changes

jmdavis reviewed Nov 6, 2011
View reviewed changes

jmdavis added a commit that referenced this pull request Nov 7, 2011

Merge pull request #299 from dawgfoto/master

f4616f8

UTF decoding optimizations

jmdavis merged commit f4616f8 into dlang:master Nov 7, 2011

MartinNowak mentioned this pull request Jul 9, 2012

Performance improvements for popFront and narrow strings. #661

Merged

marler8997 pushed a commit to marler8997/phobos that referenced this pull request Nov 10, 2019

Merge pull request dlang#299 from marler8997/rdmdFixLogging

9ceca03

Improve logging accuracy merged-on-behalf-of: Sebastian Wilzbach <sebi.wilzbach@gmail.com>

Uh oh!

UTF decoding optimizations #299

UTF decoding optimizations #299

Uh oh!

Conversation

MartinNowak commented Oct 22, 2011

Uh oh!

Choose a reason for hiding this comment

Uh oh!

DmitryOlshansky commented Oct 25, 2011

Uh oh!

ghost commented Oct 26, 2011

Uh oh!

MartinNowak commented Oct 26, 2011

Uh oh!

ghost commented Oct 26, 2011

Uh oh!

MartinNowak commented Oct 26, 2011

Uh oh!

CyberShadow commented Oct 28, 2011

Uh oh!

MartinNowak commented Oct 29, 2011

Uh oh!

CyberShadow commented Oct 29, 2011

Uh oh!

CyberShadow commented Oct 29, 2011

Uh oh!

MartinNowak commented Oct 29, 2011

Uh oh!

CyberShadow commented Oct 29, 2011

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jmdavis commented Nov 7, 2011

Uh oh!

braddr commented Nov 7, 2011

Uh oh!

jmdavis commented Nov 7, 2011

Uh oh!

jmdavis commented Nov 7, 2011

Uh oh!

jmdavis commented Nov 7, 2011

Uh oh!

Uh oh!