Skip to content

Conversation

MartinNowak
Copy link
Member

  • Use fast path tests for non-complex unicode sequences that can be
    inlined. These rely on the built-in array bounds check.
  • Factor out complex cases into separate functions that do exception
    based validity checks. The char[] and wchar[] versions use
    pointers to avoid redundant array bounds checks, thus they can only
    be trusted.
  • Complete rewrite of decode for char[] to use less branching and
    unrolled loops. This requires less registers AND less instructions.
    The overlong check is done much cheaper on the code point.
  • The decode functions were made templates to short circuit the very
    restricted function inlining possibilities.

As a rough number, I get a 2x-4x speedup in streamed string decoding,
even for unicode heavy input.

We should think of moving these functions to druntime
to avoid the duplication.

dchar decode(R:const(char[]))(R str, ref size_t index) @trusted pure
in
{
assert(index < str.length);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a nitpick, but some kind of message should be here, right?
Basically around the same as in std.range/std.array.

@DmitryOlshansky
Copy link
Member

Nice job here. Other then these tiny messages bit, it looks great.
And 2-4x times performance, just wow :)

@ghost
Copy link

ghost commented Oct 26, 2011

It would be orders of magnitude faster if you got rid of those exceptions. It's nice being able to continue decoding and replace invalid code points with something useful on the screen (e.g. Scintilla shows hex bytes for invalid code points). But with exceptions this function is too slow to be useful. In my own implementation the decode function takes an extra ref bool valid parameter, which I can check in a loop and then insert an "invalid character mark" and continue decoding. For large text files the difference in speed is colossal.

@MartinNowak
Copy link
Member Author

I don't get your point.
Are you complaining that decoding tons of invalid code points is slow?

Exceptions are no-ops in a no error case just a bunch of conditional jump
which you also need for a boolean error.
Also you would need to manually resync the utf stream to the next valid start
and it remains questionable if one should continue to decode a corrupt stream at all.
Where is your decode function?

@ghost
Copy link

ghost commented Oct 26, 2011

You're right about your claim about lots of invalid points. I was testing reading this file:
http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt

With exceptions I need around ~150 msecs to parse the file and skip invalid code points with the old decode function. With my own decode function the average time is 2 msecs. The diff between the two functions is that my decode's header is dchar mydecode(in char[] s, ref size_t idx, ref bool valid), the bool is set to false by a goto:

  Lerr:
    valid = false;
    return 0;

If I use decode from this pull request the time jumps to ~350 msecs for that UTF test file. Maybe the real issue is D's slow exceptions. Anyway that's testing a file with a lot of invalid sequences, so all the exceptions being thrown add up the time.

A different issue I have is that decode seems to add a constant factor of 30msecs to loading a valid utf file which never triggers any exceptions. I don't know what could be causing this. decode from this pull adds 20 msecs instead.

Anyway, I'll try to make a small platform-independent test-case later so I can confirm my findings, I don't want to block this pull since it could very well be my fault.

@MartinNowak
Copy link
Member Author

I've check all 2^8 1-char, 2^16 2-char, 2^24 3-char and 2^32 4-char inputs against
the old implementation, they even throw the same exception messages.
Furthermore i checked decoding of some million random char, wchar and dchar arrays
with the same result.

The only change, which is done by design, is that the enforcement index < str.length is replaced
with an assertion. The assertion is in the templated code so it will react to the user build configuration.

Overall I'm pretty confident that this introduces no regression.

@CyberShadow
Copy link
Member

These rely on the built-in array bounds check.

I didn't look too closely at the code, so this question might not make any sense at all: what happens if Phobos is rebuilt with -release (or the respective function is turned into a template and is built as part of a program with -release)? Is there a change of segfaults / data corruption / security vulnerabilities?

@MartinNowak
Copy link
Member Author

It now behaves as does .front and .popFront for arrays.

noflags : AssertError w message
release : RangeError
release noboundscheck: undefined behavior

Note: This only applies to the element at index.
Every further decoding is explicitly enforced, because there
are no means to check it beforehand.

@CyberShadow
Copy link
Member

Strings are present in most forms of user input, so I think we should be extra-careful here. It's not unimaginable that a programmer could disable bounds checking as an overzealous optimization.

When you say "undefined behavior", what's the worst that could happen (going from no-side-effects to remote code execution)?

@CyberShadow
Copy link
Member

Another issue: Do I understand correctly that the changes make the code throw Error-derived classes?

Error-derived classes indicate unrecoverable errors. Validation failure of an invalid UTF-8 string should not be an unrecoverable error. Catching errors and throwing exceptions seems nefarious as well...

@MartinNowak
Copy link
Member Author

No they are still UTFExceptions.
To summarize again, the function behaves exactly as the old one. Especially in the presence of corrupt utf sequences or too short input.

The only difference is accessing the first code unit with index >= str.length.
This will now behave the same as an unchecked array.front access, that is
you will get an AssertError/RangeError or unchecked memory access in
descending order of disabled checks.

auto str = "Hello";
size_t index = 20;
std.utf.decode(str, index)

@CyberShadow
Copy link
Member

OK, thanks for the clarification :)

@@ -549,92 +550,103 @@ size_t toUTFindex(in dchar[] str, size_t n) @safe pure nothrow
$(D UTFException) if $(D str[index]) is not the start of a valid UTF
sequence.
+/
dchar decode(in char[] str, ref size_t index) @safe pure
dchar decode(R:const(char[]))(R str, ref size_t index) @trusted pure
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd argue that this should be something like

dchar decode(C)(C[] str, ref size_t index) @ trusted pure
    if(is(Unqual!C == char))

It's far more typical in Phobos to use template constraints. I know that Andrei considers that to generally be a better choice, and I agree with him.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

D'accord.

It needs to test for implicit conversion to what
previously was the overloaded parameter or would break code.

dchar decode(S)(S str, ref size_t index) @trusted pure
    if(is(S : const(char[]))

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, what I suggested works just fine. C[] where is(Unqual!C == char) is true, will work with any array of char. You can pass const char[] to const(char)[] just fine, because the original array is unaltered (since it isn't passed in to the function - a slice of it is) and the elements are still const.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately it does not, which is why I went with specialization in the first place.
For string enums the array element deduction will fail.

enum XYZ : string { a = "foo" };
size_t i;
decode(XYZ.a, i);

XYZ can't be matched with C[].

Probably the compiler could do better here, but currently it would add a breaking change.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm. Well, then it should probably do what some of std.string does, which would be something like

dchar decode(C)(const(C)[] str, ref size_t index) @trusted pure
    if(is(Unqual!C == char))

And actually, it arguably should be doing something like that anyway, since in general, it's better to make parameters const when they can be. With in, it was const, but with your changes, it's not. It works with both const and non-const, but it's not const for both. And even if that doesn't work for some reason, it's probably still better to add const to what you have.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That won't work either, the first parameter is of type XYZ. The compiler can't deduce an array element type
from that. I've added a qualifier though.

dchar decode(S)(in S str, ref size_t index) @trusted pure
     if(is(S : const(char[]))

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it can't, then it should probably be reported as a bug, but what you have works, and if the compiler won't let it work the other way, then it won't let it work the other way - bug or not. So, it's fine as-is then.

@jmdavis
Copy link
Member

jmdavis commented Nov 7, 2011

You're going to need to rebase this before it gets committed. It's fine as it is for review purposes, but there are too many small commits for it to be merged in as-is.

@braddr
Copy link
Member

braddr commented Nov 7, 2011

jmdavis: I strongly disagree. I much prefer to see small commits that are trivial to understand and trivial to review.

@jmdavis
Copy link
Member

jmdavis commented Nov 7, 2011

I'm not saying that all pull requests should be a single commit or anything like that. I'm saying that you shouldn't have a bunch of commits that change only a few lines - especially when it's only one line. Also, it's not infrequent that there are suggestions in reviews which result in changes which would just be cleaner if the commits were rearranged so that the change is in the original commit. Take the "rename exception factory" commit for instance. If you change the original commit which introduced that function so that it was name exception instead of error, it's just as clear, and you have one fewer commits.

Commits should be broken up logically such that it's easier to understand which changes are being made - huge commits with tons of changes are problematic - but having a ton of small commits which could be merged together to result in cleaner commits which are just as understandable - if not more so - is not a good idea IMHO. In such cases, it's better to merge those commits.

 - Use fast path tests for non-complex unicode sequences that can be
   inlined. These rely on the built-in array bounds check.

 - Factor out complex cases into separate functions that do exception
   based validity checks.  The char[] and wchar[] versions use
   pointers to avoid redundant array bounds checks, thus they can only
   be trusted.

 - Complete rewrite of decode for char[] to use less branching and
   unrolled loops. This requires less registers AND less instructions.
   The overlong check is done much cheaper on the code point.

 - The decode functions were made templates to short circuit the very
   restricted function inlining possibilities.
@jmdavis
Copy link
Member

jmdavis commented Nov 7, 2011

As far as I can see, this is fine now, but I'm not going to merge it in at the moment, because the Phobos unit tests aren't currently building due to a recent commit, and I don't like to merge stuff in when we can't verify that it's not breaking something.

jmdavis added a commit that referenced this pull request Nov 7, 2011
UTF decoding optimizations
@jmdavis jmdavis merged commit f4616f8 into dlang:master Nov 7, 2011
@jmdavis
Copy link
Member

jmdavis commented Nov 7, 2011

Merged.

marler8997 pushed a commit to marler8997/phobos that referenced this pull request Nov 10, 2019
Improve logging accuracy
merged-on-behalf-of: Sebastian Wilzbach <sebi.wilzbach@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants