Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix Unicode bugs with UTF-16/UTF-32 conversions (#10959) #11004

Closed
wants to merge 3 commits into from

Conversation

ScottPJones
Copy link
Contributor

Rewrote a number of the conversions between ASCIIString, UTF8String,
UTF16String, and UTF32String, and also rewrote length() for
UTF16String().

Added validation functions that can be used in many other places, which also fix a number of Unicode handling bugs, such as detecting overlong encodings, out of range characters, singleton surrogate characters, and missing continuation characters, among other problems.

Added over 150 lines of testing code to detect the above conversion problems

Added (in a gist) code to show other conversion problems not yet fixed:
https://gist.github.com/ScottPJones/4e6e8938f0559998f9fc

Added (in a gist) code to benchmark the performance, to ensure that adding the extra validity
checking did not adversely affect performance (in fact, performance was greatly improved).
https://gist.github.com/ScottPJones/79ed895f05f85f333d84

@ScottPJones
Copy link
Contributor Author

@mschauer No reason other than force of habit... too much time jumping between languages where types were required (C,C++,Java), and a language that had no types at all (which also has for loops which are slower than a while loop, because the for loop pushes a frame, while while does not)
Does it need to be changed?

@ScottPJones
Copy link
Contributor Author

Help! How can I find out if this failed travis-ci build had anything to do with my changes?

@elextr
Copy link

elextr commented Apr 25, 2015

It seems to have failed downloading the rmath tarball, probably not something you did I think.

@mbauman
Copy link
Sponsor Member

mbauman commented Apr 25, 2015

I've restarted that build. You can check the build log yourself to see if the failure seems related, but if you're still not sure it's best to ask.

@mschauer
Copy link
Contributor

@ScottPJones It was just a style remark, using while indicates to the reader that the index might be incremented irregularly.

@ScottPJones
Copy link
Contributor Author

@mschauer The other reason I used while instead of for (I did use for later on) was because my other loops were modifying the index within the loop (skipping over UTF-8 sequences or UTF-16 surrogate pairs)
I welcome any and all constructive criticism, I think it's one of the best ways to learn to improve your programming, listening to people who've already been around the block... and I am such a newbie with regards to Julia! (I hope at some point though to join the exalted ranks of the frequent contributors to Julia 😀 )

@ScottPJones
Copy link
Contributor Author

I also have a change that makes length of a UTF-8 string about twice as fast, but I'm waiting to hear back from @JeffBezanson (as it is his u8_charnum function I changed - some stuff he wrote apparently 10 or more years ago) before making a pull request (also, I'm still figuring out how to best use git/github to be able to do multiple pull requests at a time)
About the changes, I have an issue about benchmarking... I suspect that Julia is eating part of my benchmark 😁 For ASCIIString and UTF32String, the results for length seem impossibly fast, and I suspect that Julia (or LLVM) is figuring out that it isn't really doing anything, and completely optimizing away my test loop for i=1:n ; length(s) ; end... I do want the length in-lined into the for loop if possible, but I want to prevent it from eliminating the for loop entirely... Any help as to how to benchmark a function like that?
Thanks everybody!

@pao
Copy link
Member

pao commented Apr 25, 2015

I'm still figuring out how to best use git/github to be able to do multiple pull requests at a time

git checkout master
git checkout -b topic/other-unrelated-feature

You can then push that branch to your remote, and make a pull request from it, unrelated to the other branch and pull request.

As to your other question, if you suspect the loop is getting optimized out incorrectly, there are two things to do:

  1. Check if that's actually true by looking at @code_native results. I suspect based on your background you're not too uncomfortable with assembly syntax.
  2. If it is true, make the optimization invalid by making the operation observable. You can use @code_native to verify.

@ScottPJones
Copy link
Contributor Author

@pao Again, sorry for the newbieness, but, how would I make it observable in Julia? (without greatly affecting the performance of what I want to measure) Is there any good write ups of best practices for benchmarking Julia [benchmarking small things, like this, where the very for loop may take as much time as the operation being performed...])

@pao
Copy link
Member

pao commented Apr 25, 2015

I don't know that there's any good references or generalizable rules for defeating an optimizer.

It appears to be enough (at least with the current definition of length) to dead-store every iteration, and output the final result:

function tricksy(s,n)
    local l
    for i in 1:n
        l = length(s)
    end
    l
end

@kmsquire
Copy link
Member

Regarding the Int annotations: yes, they should probably be removed. They are rarely necessary in Julia, and it's better to match the existing style.

@ScottPJones
Copy link
Contributor Author

@kmsquire Do I need to change them before they will get merged, or can I hold off until my next round of speedups? (which will probably be in a short time after some more benchmarking)

@kmsquire
Copy link
Member

@ScottPJones, if you make the changes on the same branch and push to github, they'll appear here. This is the common method for updating pull requests.

In some cases, that may go on for a few rounds, so when things are ready, typically we ask that people squash the commits down to one commit and then force push to the same branch. This is to keep the commit history of the main repository somewhat clean, and also makes it easier to track down problems using git bisect (since intermediate commits with bugs won't make their way into the repo).

@ScottPJones
Copy link
Contributor Author

@kmsquire, I haven't been able to figure out just what I need to do to squash the commits... Help please?

@ScottPJones
Copy link
Contributor Author

@pao, thanks for the suggestion... now I get the following (and this is with my improved length() for UTF8String), so it is almost UTF-8 is 7.7 times slower with just 16 characters, and 3753 with 64K...
(and that's not even a really large size... documents in Spanish, French, Russian, or Japanese, like magazines, books, etc. would be larger...
I think it was @stevengj who wondered why I avoid using UTF-8 like the plague, except for I/O to external sources (such as a file, JSON, XML)... I hope this explains it!

Looping 100000 times

strAscii: sizeof=16 length=16
strUTF8:  sizeof=26  length=16
strUTF16: sizeof=36 length=16
strUTF32: sizeof=64 length=16

ASCII length
elapsed time: 0.002165507 seconds (0 bytes allocated)
UTF-8 length
elapsed time: 0.015955597 seconds (0 bytes allocated)
UTF-16 length
elapsed time: 0.061743115 seconds (0 bytes allocated)
UTF-32 length
elapsed time: 0.002081889 seconds (0 bytes allocated)

Looping 10000 times

strAscii: sizeof=65536 length=65536
strUTF8:  sizeof=106496  length=65536
strUTF16: sizeof=147456 length=65536
strUTF32: sizeof=262144 length=65536

ASCII length
elapsed time: 0.000127822 seconds (312 kB allocated)
UTF-8 length
elapsed time: 0.45041207 seconds (312 kB allocated)
UTF-16 length
elapsed time: 2.188840351 seconds (312 kB allocated)
UTF-32 length
elapsed time: 0.00012162 seconds (312 kB allocated)

julia> 0.01596/0.00208
7.673076923076923

julia> 0.4504/0.00012
3753.3333333333335

@catawbasam
Copy link
Contributor

@ScottPJones, http://julia.readthedocs.org/en/latest/manual/packages/#making-changes-to-an-existing-package contains information that may be helpful (some is relevant to base as well as packages), including stuff relating to squash.

@stevengj
Copy link
Member

@ScottPJones, can you give some profiling numbers from a real application in which the UTF8-UTF16 conversions etc. are a significant portion of the time? As I'm sure you recognize, there is a tradeoff in optimizing code like this—making the code much more complicated in order to make it faster makes it harder to maintain etc. in the future.

@ScottPJones
Copy link
Contributor Author

@stevengj Not really, for a few reasons... I am no longer consulting for the company where I architected and wrote a large part of the national language and Unicode support, so I no longer have access to internal benchmarks from that company. After being responsible for the core of a product used around the world, for almost 29 years, I would say that I know quite well the tradeoffs between code complexity and maintainability... I would say that having something that is simply too slow to be useful, so that people end up implementing their own solutions, doesn't really help. If something is O(n) on some of the most frequent operations instead of O(1), people aren't going to want to use it. Back in 1995, I had to port the product to work in Japan... at the time, I was being pushed in the direction of using a multibyte character set, but two other implementations of the language/database with Japanese versions had done so, using SJIS and EUC, and they just didn't perform that well compared to the 8-bit versions... I had heard about Unicode (it was at version 1.0, just 16-bit, no surrogates), and I pushed for using that instead... keeping a boundary between the byte-code interpreter, the database, and IO modules, so that core always used 16-bit characters (like Java, which also was designed before Unicode 2.0 and surrogates were added), and conversions are only done when setting/getting to/from the database, or reading/writing to external devices... It has been in production since 1997, and is used all over the world... and has been subjected to constant benchmarking... which has shown that the Unicode version of the platform performs within a few % of the 8-bit version... Also, for that system, since the only time that there was any UTF-16 <-> UTF-8 conversions was when data was coming/going to a file or over the network, it never was a significant portion of the time (also, because I optimized the hell out of the conversion functions). For this current project though, we may be getting (I believe) a lot of data in UTF-8 format (as JSON or XML)... so I am very concerned about the performance of those conversions.

@kmsquire
Copy link
Member

Hi Scott, I think Steven was just asking for some evidence of your improvements on real code (as opposed to the performance tests you've been posting).

if len == 0
buf = Array{UInt16,1}(1)
buf[1] = 0
return UTF16String(buf)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, consider replacing this whole block with

return UTF16String(UInt16[0])

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, good... I didn't know you could just say UInt16[0] to get an array with just the 0. I'll change those.

Copy link
Sponsor Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

zeros(UInt16,1) is another alternative

@JeffBezanson
Copy link
Sponsor Member

Saying "3753 times slower" is a bit misleading; it's just that one is O(1) and the other is O(n). I'd compare two O(n) operations, like searching for a character.

@kmsquire
Copy link
Member

Re: squashing: while it's good to do (and good to know how to do), generally, it's better to wait until any changes have made their way into this pull request.

@JeffBezanson
Copy link
Sponsor Member

I don't love the large amount of manual inlining in this code. I'd be really curious to see how close we can get with more generic code. Ideally, a conversion to UTFxx would loop over its input string and consume Chars, letting the compiler specialize it for different input string types.

In our current code, push! is a large source of slowness, and just eliminating that would help a lot.

There seems to be a full UTF-8 validation routine copypasted into each function that consumes UTF-8 (and similarly for other types). I don't think that's acceptable. We need to find a better way to factor this.

Aside: UTF-16 sucks.

@ScottPJones
Copy link
Contributor Author

@JeffBezanson This was my first large bit of Julia code (before that, I'd been writing a lot of wrappers to my C libraries for packing/unpacking data, calling Aerospike, etc., nothing more than 5 lines), and I'd love to know how to do this better, without sacrificing performance.
The code is not identical though, between the copies in utf8_to_utf16 and utf8_to_utf32, because the amount of "extra" characters it calculates is different... when converting utf8 to utf16, if you have a surrogate pair, you are going from 4 bytes -> 2 characters, and if they have each character of the pair encoded separately it is 6 bytes -> 2 characters, but with utf8 to utf32, you go from 4 or 6 bytes -> 1 character.
I had tried to have a separate function, but it did slow things down significantly, I was trying to return multiple values instead of one (i.e with separate counts for 2 or 3 byte, 4 byte, and 6 byte sequences,
so that the caller could then calculate the number of characters needed in the result... if you return a tuple and assign it to 3 separate local variables, does Julia end up creating an object and then immediately have to garbage collect it? If so, that's probably what killed the performance of having a single function... I'm sure there's a better way in Julia, but I'm just learning, and I'd love any suggestions from the experts...

Yes, UTF-16 sucks, but so does UTF-8 in a lot of ways... try convincing Japanese customers that UTF-8 is better than SJIS encoding for their data, when it takes 50-100% more space... (I had to come up with a very fancy way of encoding Unicode data, which frequently took < 1byte per character on their data and yet works even on very short strings of just a few characters... to keep them happy...)

@ScottPJones
Copy link
Contributor Author

@JeffBezanson I disagree that it is misleading - I was just comparing length there, and the whole point is that it is O(n) instead of O(1)... I could not change the customers' applications, and the most common operations were: getting the length of a string, searching for a single character, getting the nth substring delimited by a single character, replacing the nth substring delimited by a single character, extracting a single character at position x, extracting a substring from x to y, replacing a substring from x to y.

For one particular set of customers (everybody using the implementation of Pick/Multivalue), where the Pick language uses "system delimiters" in the range 248-255, all of those incredibly frequent delimiters become 2 bytes, and mean I'd have to do searches for 2 bytes instead of one all the time, with constant matches on the first byte... I don't have any numbers anymore, but there was a big difference in speed there)

@ScottPJones
Copy link
Contributor Author

@JeffBezanson About making things more generic, I tried to do that at first, and then ran into all sorts of headaches because of the inconsistencies between UTF-8 (UInt8, but the \0 terminator is "hidden"), UTF-16 (UInt16, but you must add an explicit \0), and UTF-32 (Char, not UInt32, need explicit \0 terminator, need to do things like Char(0) instead of 0, and reinterpret(UInt32,s[i])).
Now do you see why I started to complain a few days ago about those inconsistencies in the internal representations?

@ScottPJones
Copy link
Contributor Author

@kmsquire Sorry - that fact that I don't have any "real code" using the conversions yet, nor access to the code & benchmarks from before, is a sore point with me! Also, any "real code" that I'm working on using the conversions would be in confidential, proprietary code for the startup I'm consulting for...
I'd prefer to get the performance of Julia up to acceptable levels for the string operations that I need, and make that available to all users of Julia, rather than just do all of the conversions in my own C libraries.
(I actually might end up reworking this to do all of this in C anyway..., [should I add them to utf8.c and rename it simply utf.c?] I tried to use the length speedup technique for UTF-8 in Julia first, but it was still slower than the calling the C code)
If I knew of Julia applications that were already doing a lot of conversions (say to talk to a database, or to Java), then I'd love to see some benchmarking of those with my changes.

@ScottPJones
Copy link
Contributor Author

@JeffBezanson If I add something to julia/src/support, i.e. utf.h and utf.c, for UTF-16 and UTF-32 optimizations, should I put something like @license This code is released under the MIT license for use as part of Julia ?
I noticed your code states in utf8.c "placed in the public domain Fall 2005"
Also, why is there almost no documentation in the src code?
It would be helpful for future programmers to have all of that documented, esp. documented with the markers for using doxygen...

@tkelman
Copy link
Contributor

tkelman commented Apr 26, 2015

If I add something to julia/src/support, i.e. utf.h and utf.c, for UTF-16 and UTF-32 optimizations, should I put something like @license This code is released under the MIT license for use as part of Julia?

No, that's the default for contributions to this repository. Some of the code from src/support and src/flisp was reused from earlier projects that weren't all MIT-licensed.

Also, why is there almost no documentation in the src code?

Known bug. Major point of contention between those who wrote most of that code vs those trying to understand it. Documentation/comments should be kept up to date, which means more work than just writing the code. Some pieces are starting to be documented separately from the code in devdocs, some prefer external longer-form design documents rather than inline comments. I personally like the locality of comments, but either way if more people volunteer to write, read, and keep documentation of any form up-to-date it would be welcomed.

elseif ch < 0x800
num2byte += 1
flags |= UTF_UNICODE2
elseif !is_surrogate_codepoint(ch)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I said, codepoint here is probably incorrect AFAICT. I don't think you replied to that particular comment of mine, but isn't codeunit more correct?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you are quite right, just brain rot mixing those up... I'll fix.

@StefanKarpinski
Copy link
Sponsor Member

I'm closing this pull request as unreviewable. Please factor into smaller, reviewable pieces.

@hayd
Copy link
Member

hayd commented Jun 2, 2015

@StefanKarpinski Do you mean the diff is unreviewable, or this PR page? Is splitting up into multiple commits what you're asking @ScottPJones or rethink/work it completely?

@ScottPJones
Copy link
Contributor Author

Really???
This seems downright spiteful.
26 people have been reviewing this for over a month, and a number of people have already said that it is ready to be merged in. In fact, this is probably the most reviewed PR in the history of Julia (at least, that's what GitHub's stats seem to point to), and I think all of the top contributors have weighed in on it.
Not the sort of action that encourages people to want to help out with improving Julia, not at all.

@ScottPJones
Copy link
Contributor Author

yes. ping at the bottom when you've addressed the minor notes mentioned and i agree this is ready to merge.

@vtjnash Thanks for all the helpful review, I had addressed those minor notes and pushed the changes to my fork, but because this has been closed, they don't seem to show up here.

@ScottPJones
Copy link
Contributor Author

So, here's a toast 🍷 to both the code-author and all the reviewers of this issue! Nice job sticking with it, everyone. I'm a string noob so can't help in any meaningful way, but you all have my good wishes for a speedy and constructive conclusion!

@timholy I guess you spoke too soon, unfortunately. Thanks for all of your help reviewing this, along with everybody else.

@timholy
Copy link
Sponsor Member

timholy commented Jun 2, 2015

Well, what's closed can be reopened. But honestly, sometimes a clean start is both better and easier. Heck, on #11242 I restarted twice (see #10691, #10911). I found it was easier to copy/paste my changes into a new file, and it gave me a chance to think things through from scratch, having learned a lot about julia's guts from previous failed attempts. Finally, the good parts of my earlier attempts were salvaged with, like, 30 minutes of work, while the carry-it-forward (i.e., new stuff) took a week. So, don't take this as a statement of the value that folks find in your code, but simply a way to start from a cleaner history and let all your work shine!

@tkelman
Copy link
Contributor

tkelman commented Jun 2, 2015

There's actually a github limitation/bug that means you literally can't reopen a closed PR if you've made any pushes to the branch after it was closed.

But figuratively, I do think the core improvements here are close to ready, if we can find smaller incremental ways of getting them in piece by piece.

@ScottPJones
Copy link
Contributor Author

As I said before, I don't think this particular change can be split up. Others who have reviewed the code seem to have agreed.

As @IainNZ said:

I'd rather take working code now and refactor it later to reduce complexity than leave broken code as it is. There seems to be an implied alternative of a simpler & correct code, but if no one is going to step up to make that PR, then this is the next best thing.(Also, breaking this PR into smaller ones seems fairly non-trivial to me, even if its generally desirable.)

@tkelman
Copy link
Contributor

tkelman commented Jun 2, 2015

We can immediately add any of the tests that would pass with the implementation we have on master. Test additions are almost always easy, noncontroversial merges.

Then we look at the tests that are added here where master would fail. I bet, taken a few failing tests at a time, we can get to the same place.

@hayd
Copy link
Member

hayd commented Jun 2, 2015

One way it can be reduced, is going back to the state before (with utf16.jl and utf32.jl), and applying the changes you've made to each function in the original place. This'll reduce the diff considerably... it may be pretty tricky but honestly it's not that big of a diff to actually do it, really it's only a few hundred lines. :s Worth a quick try on another branch?

@ScottPJones
Copy link
Contributor Author

Sorry, @tkelman, but I've donated quite a bit of my time already to trying to fix the many serious problems I've been finding (including the one you pointed me to yesterday) in the Julia code.
I just don't see why I should be spending even more time dealing with it.
The code is broken now. It needs to be fixed, if people want it to be reliable.
The unit test system is not good at all for dealing with tests of things known to already be broken, and checking that changes just move the success rate forward, without regressions (as I've brought up elsewhere).

@tkelman
Copy link
Contributor

tkelman commented Jun 3, 2015

And you've been successful in fixing quite a few of them, which is hugely appreciated. I hope you won't overreact to one (large and overly-drawn-out) PR getting closed. If I had known this was going to be the resolution, I would have closed it myself and said the same thing Stefan just said but 3 or 4 weeks sooner to avoid the mess, but hindsight's 20/20.

Needing a better unit test system IS A DIFFERENT ISSUE.

The beauty of git is if you don't have time to see this through, anyone else can make a copy of your branch (ScottPJones@21bbaef looks like the most recently pushed commit) and try cleaning it up themselves. Preserving authorship information wherever appropriate, of course.

@ScottPJones
Copy link
Contributor Author

@tkelman My comment about the unit test system, was because of your recommendation to go painfully bit by bit, adding back in the unit tests that don't fail (not too many of those, actually). It may be a different issue, but was still on topic replying to you.

@ScottPJones
Copy link
Contributor Author

I do think that @StefanKarpinski 's way of acting, is detrimental to the Julia community.
I can understand closing a PR if the author is not responsive to issues brought up during review (does anybody here believe that has been the case here?), or if there has been no review activity over a long period of time (i.e. nobody cared enough about the PR to donate their valuable time to reviewing it... I dare say that has NOT been the case here either).
If he felt that it should be refactored in some way, before merging, then he should have brought that up before simply closing it, and worse, by saying it was "unreviewable", essentially calling all the people who were actively doing just that fools for wasting their time reviewing the unreviewable.
Again, not at all the sorts of actions that help build a community.

@ScottPJones
Copy link
Contributor Author

But honestly, sometimes a clean start is both better and easier. Heck, on #11242 I restarted twice (see #10691, #10911).

@timholy Yes, in general that is true, but I believe I've already done just that in this case... First I had a pure Julia version, then I tried to improve it using my C library, then people said it had to be pure Julia, so I rewrote it based on my C code, refactored, etc. etc.

@tkelman
Copy link
Contributor

tkelman commented Jun 3, 2015

your recommendation to go painfully bit by bit

Not just mine. A significant amount of the feedback all along has been (paraphrasing) "this is too big to do all at once," and "this has fundamental issues due to too much repeated code" and you haven't addressed that. You don't need to add all the failing tests at once with some sophisticated test tracking system to refactor this code. If you don't have time to come up with a different way of fixing the same set of bugs, someone else will.

@ihnorton
Copy link
Member

ihnorton commented Jun 3, 2015

As I said before, I don't think this particular change can be split up. Others who have reviewed the code seem to have agreed.

Here's a potential path forward: make a new PR with the validation functions and associated tests only. Those are self-contained, valuable, and seem review-able on their own (few hundred LoC max). This will take three minutes, wouldn't require writing any new code, and adding this code won't change any existing behavior. Once those are in, then subsequent PRs could be made to start using them.

@timholy
Copy link
Sponsor Member

timholy commented Jun 3, 2015

Given that this PR has had a messier history than most, I totally agree that this is not the catastrophe you're making it out to be. If you think your PRs aren't getting merged quickly enough, just check out the history of @vtjnash's PRs. You have a very good track record of getting things in quickly because you push hard, but I think many in the community would be happier if you didn't push quite so hard 😄.

My advice: just treat this as a "process" decision and nothing personal, and resubmit. Because that's the reality, and it doesn't help anything or anyone to treat it otherwise.

@ScottPJones
Copy link
Contributor Author

@timholy I'm not making it out to be a catastrophe, but I do think that things were not handled well at all.
I already have resubmitted - I hope this time, things go a bit more smoothly (they could hardly go any worse, now could they?!? 😀)

@timholy
Copy link
Sponsor Member

timholy commented Jun 3, 2015

Presuming the resubmission included refactoring along lines suggested by others, that's truly excellent. Cheers!

@hayd
Copy link
Member

hayd commented Jun 3, 2015

The new PR is #11551, and honestly it looks like a big improvement; smaller diff and more separate files/concerns. :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
domain:unicode Related to unicode characters and encodings
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet