-
-
Notifications
You must be signed in to change notification settings - Fork 741
UTF decoding optimizations #299
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
dchar decode(R:const(char[]))(R str, ref size_t index) @trusted pure | ||
in | ||
{ | ||
assert(index < str.length); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a nitpick, but some kind of message should be here, right?
Basically around the same as in std.range/std.array.
Nice job here. Other then these tiny messages bit, it looks great. |
It would be orders of magnitude faster if you got rid of those exceptions. It's nice being able to continue decoding and replace invalid code points with something useful on the screen (e.g. Scintilla shows hex bytes for invalid code points). But with exceptions this function is too slow to be useful. In my own implementation the |
I don't get your point. Exceptions are no-ops in a no error case just a bunch of conditional jump |
You're right about your claim about lots of invalid points. I was testing reading this file: With exceptions I need around ~150 msecs to parse the file and skip invalid code points with the old decode function. With my own decode function the average time is 2 msecs. The diff between the two functions is that my decode's header is
If I use A different issue I have is that Anyway, I'll try to make a small platform-independent test-case later so I can confirm my findings, I don't want to block this pull since it could very well be my fault. |
I've check all 2^8 1-char, 2^16 2-char, 2^24 3-char and 2^32 4-char inputs against The only change, which is done by design, is that the enforcement index < str.length is replaced Overall I'm pretty confident that this introduces no regression. |
I didn't look too closely at the code, so this question might not make any sense at all: what happens if Phobos is rebuilt with -release (or the respective function is turned into a template and is built as part of a program with -release)? Is there a change of segfaults / data corruption / security vulnerabilities? |
It now behaves as does .front and .popFront for arrays. noflags : AssertError w message Note: This only applies to the element at index. |
Strings are present in most forms of user input, so I think we should be extra-careful here. It's not unimaginable that a programmer could disable bounds checking as an overzealous optimization. When you say "undefined behavior", what's the worst that could happen (going from no-side-effects to remote code execution)? |
Another issue: Do I understand correctly that the changes make the code throw Error-derived classes? Error-derived classes indicate unrecoverable errors. Validation failure of an invalid UTF-8 string should not be an unrecoverable error. Catching errors and throwing exceptions seems nefarious as well... |
No they are still UTFExceptions. The only difference is accessing the first code unit with index >= str.length. auto str = "Hello"; |
OK, thanks for the clarification :) |
@@ -549,92 +550,103 @@ size_t toUTFindex(in dchar[] str, size_t n) @safe pure nothrow | |||
$(D UTFException) if $(D str[index]) is not the start of a valid UTF | |||
sequence. | |||
+/ | |||
dchar decode(in char[] str, ref size_t index) @safe pure | |||
dchar decode(R:const(char[]))(R str, ref size_t index) @trusted pure |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd argue that this should be something like
dchar decode(C)(C[] str, ref size_t index) @ trusted pure
if(is(Unqual!C == char))
It's far more typical in Phobos to use template constraints. I know that Andrei considers that to generally be a better choice, and I agree with him.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
D'accord.
It needs to test for implicit conversion to what
previously was the overloaded parameter or would break code.
dchar decode(S)(S str, ref size_t index) @trusted pure
if(is(S : const(char[]))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, what I suggested works just fine. C[]
where is(Unqual!C == char)
is true
, will work with any array of char
. You can pass const char[]
to const(char)[]
just fine, because the original array is unaltered (since it isn't passed in to the function - a slice of it is) and the elements are still const
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately it does not, which is why I went with specialization in the first place.
For string enums the array element deduction will fail.
enum XYZ : string { a = "foo" };
size_t i;
decode(XYZ.a, i);
XYZ can't be matched with C[].
Probably the compiler could do better here, but currently it would add a breaking change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm. Well, then it should probably do what some of std.string does, which would be something like
dchar decode(C)(const(C)[] str, ref size_t index) @trusted pure
if(is(Unqual!C == char))
And actually, it arguably should be doing something like that anyway, since in general, it's better to make parameters const when they can be. With in
, it was const
, but with your changes, it's not. It works with both const
and non-const
, but it's not const
for both. And even if that doesn't work for some reason, it's probably still better to add const
to what you have.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That won't work either, the first parameter is of type XYZ. The compiler can't deduce an array element type
from that. I've added a qualifier though.
dchar decode(S)(in S str, ref size_t index) @trusted pure
if(is(S : const(char[]))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it can't, then it should probably be reported as a bug, but what you have works, and if the compiler won't let it work the other way, then it won't let it work the other way - bug or not. So, it's fine as-is then.
You're going to need to rebase this before it gets committed. It's fine as it is for review purposes, but there are too many small commits for it to be merged in as-is. |
jmdavis: I strongly disagree. I much prefer to see small commits that are trivial to understand and trivial to review. |
I'm not saying that all pull requests should be a single commit or anything like that. I'm saying that you shouldn't have a bunch of commits that change only a few lines - especially when it's only one line. Also, it's not infrequent that there are suggestions in reviews which result in changes which would just be cleaner if the commits were rearranged so that the change is in the original commit. Take the "rename exception factory" commit for instance. If you change the original commit which introduced that function so that it was name Commits should be broken up logically such that it's easier to understand which changes are being made - huge commits with tons of changes are problematic - but having a ton of small commits which could be merged together to result in cleaner commits which are just as understandable - if not more so - is not a good idea IMHO. In such cases, it's better to merge those commits. |
- Use fast path tests for non-complex unicode sequences that can be inlined. These rely on the built-in array bounds check. - Factor out complex cases into separate functions that do exception based validity checks. The char[] and wchar[] versions use pointers to avoid redundant array bounds checks, thus they can only be trusted. - Complete rewrite of decode for char[] to use less branching and unrolled loops. This requires less registers AND less instructions. The overlong check is done much cheaper on the code point. - The decode functions were made templates to short circuit the very restricted function inlining possibilities.
As far as I can see, this is fine now, but I'm not going to merge it in at the moment, because the Phobos unit tests aren't currently building due to a recent commit, and I don't like to merge stuff in when we can't verify that it's not breaking something. |
Merged. |
Improve logging accuracy merged-on-behalf-of: Sebastian Wilzbach <sebi.wilzbach@gmail.com>
inlined. These rely on the built-in array bounds check.
based validity checks. The char[] and wchar[] versions use
pointers to avoid redundant array bounds checks, thus they can only
be trusted.
unrolled loops. This requires less registers AND less instructions.
The overlong check is done much cheaper on the code point.
restricted function inlining possibilities.
As a rough number, I get a 2x-4x speedup in streamed string decoding,
even for unicode heavy input.
We should think of moving these functions to druntime
to avoid the duplication.