Fix several bugs in reverse(::UTF8String), add full coverage tests #12646

ScottPJones · 2015-08-16T14:42:16Z

reverse on a UTF8String used the C function u8_reverse, which I discovered in testing has several bugs.

It doesn't detect running off the end of the string when there is a char > 0x80
It picks up garbage bytes depending on the lead character
It is not portable to any machine that requires alignment.

I have rewritten it in Julia, and added tests that fully cover the function.
I wanted to remove u8_reverse from src/support/utf8.c, however that function is used by flisp for the string.reverse function, even though that function is apparently never used anywhere in any of the .scm code I have found in Base.
I wonder if the unused string functions in flisp, that are depending on broken C code, can simply be removed and save some space.

ScottPJones · 2015-08-16T17:28:39Z

@JeffBezanson might want to comment about the issues with wanting to remove unused functions from utf8.c and from flisp.

stevengj · 2015-08-17T20:35:28Z

How does the performance compare to the C version?

stevengj · 2015-08-17T20:41:23Z

string.reverse is not standard Scheme. (As far as I can tell, there is no string-reversal function in R7RS; Guile has string-reverse.) So, it is probably reasonable to remove it and the corresponding C code.

ScottPJones · 2015-08-18T01:38:15Z

@stevengj I haven't had time to do hard-core benchmarking yet (and got hammered when I mentioned performance as opposed to bug fixes as a goal not that long ago!). So far looks OK, but I've got to test all cases. I'm not worried though about performance, I've learned that I can make Julia generally as fast as C with a small amount of effort! 😀

stevengj · 2015-08-18T01:44:46Z

@ScottPJones, just a couple of typical numbers for random strings would be helpful.

ScottPJones · 2015-08-18T02:36:16Z

@stevengj I will, I just have some paid work to get done tonight, if I hope to get some sleep!

stevengj · 2015-08-18T15:17:51Z

Note the test failure:

LoadError("C:\\projects\\julia\\test\\cmdlineargs.jl",3,ErrorException("test failed: startswith(readall(@cmd(\"\\\$exename -h\")),\"julia [options] [program] [args...]\")\n in expression: startswith(readall(@cmd(\"\\\$exename -h\")),\"julia [options] [program] [args...]\")"))
 in error at error.jl:21
 in remotecall_fetch at multi.jl:728
while loading C:\projects\julia\test\runtests.jl, in expression starting on line 13
WARNING: Forcibly interrupting busy workers From worker 3:       * cmdlineargs

from

@test startswith(readall(`$exename -h`), "julia [options] [program] [args...]")

in cmdlineargs.jl.

jakebolewski · 2015-08-18T15:21:48Z

@ScottPJones that was my fault. You just need to rebase this PR on the latest master.

ScottPJones · 2015-08-19T01:12:08Z

@stevengj This is mostly faster than the C code, or the same when dealing with strinsgs with mostly ASCII or Latin1 characters, it's slower when dealing with characters that take 4 bytes in UTF-8 (> 0xffff). I put both the testing routine and results from my laptop in this gist: https://gist.github.com/ScottPJones/8feed7aa12f4ab25e76b
The C code is faster in those situations because it doesn't correctly check if it runs past the end of the string, and it uses *(uint32_t *) to copy all 4 bytes at a time (even though that is not really portable, some machines require alignment, or a __unaligned keyword to tell the compiler to use a different instruction), so I'm not so concerned about that.

ScottPJones · 2015-08-19T01:13:57Z

@jakebolewski No problem! I had to rebase anyway after I ran full benchmarks and saw I needed to improve the performance a bit more.

stevengj · 2015-08-19T16:11:17Z

@ScottPJones, looks great! The slight performance penalty for safety looks perfectly fine here.

stevengj · 2015-08-19T16:12:09Z

I think you should go ahead and delete the old C code and the corresponding flisp code in this PR. No point in keeping buggy unused code in the tree just to implement a non-standard unused function in Scheme.

ScottPJones · 2015-08-19T18:17:32Z

Yes, I was planning on doing that as soon as this gets merged, thanks!

ihnorton · 2015-08-19T18:25:29Z

You can do the removal as another commit on this branch. (also need to squash the three existing commits)

stevengj · 2015-08-19T18:45:02Z

It would be cleaner to have a single PR.

ScottPJones · 2015-08-19T18:49:06Z

OK, other people have asked me to do focused single issue PRs, and I would think removing something from flisp is really separate from this.
About the squashing, others have also asked me to have things logically separate, i.e. bug fixing in one commit, new tests in another (in the same PR). (I've also been watching how Kate does it, with a number of logically distinct commits in a PR).

ScottPJones · 2015-08-19T18:51:15Z

(if you really want it as a single PR, and only 2 commits, I'll do so, I just want to be sure what to do, in the face of conflicting requests at different times)

stevengj · 2015-08-19T20:03:16Z

@ScottPJones, if your PR replaces X with Y, then removing X is a part of the same issue.

stevengj · 2015-08-19T20:05:54Z

The main reason to separate bugfix commits is so that they can be backported, but that's not going to happen with this commit (it depends on too many other changes that have been made in 0.4). In this case, I would tend to squash everything into a single commit because they are all a logical unit ("replace X with Y and test"). But I don't think it's a big deal as-is.

ivarne · 2015-08-19T21:32:12Z

How to partition a set of changes into commits and pull requests is a mostly stylistic issue, but for some workflows (eg. git bisect, git cherry-pick and git revert), it makes an important difference. The goal is usually to make it easy to review, and different people have slightly different preferences. Key aspects in reviewability is size (smaller is better), single concern (independent of other changes) and completeness. Some seemingly contradictory advice is inevitable with different reviewers, especially when they put different values on each of the aspects and you don't understand the reasoning behind their requests.

ScottPJones · 2015-08-19T21:45:47Z

if your PR replaces X with Y, then removing X is a part of the same issue.

If X depends on A, and Z depends on A, and a PR rewrites X to no longer depend on A,
I don't think that necessarily implies that A should be removed in the same PR.

That is why I really think that that (which might have some bikeshedding about removing functionality from flisp, broken and unused or not), should be in a separate PR, and not hold this up.

About squashing, Stefan had praised me the last time for having separate commits for doing what Kate normally does, i.e. separate commits in a PR for the bug fixing vs. the testing, and I also just this last week got told I made a mistake by putting bug fix / test in the same PR.
I'll squash the first and third commit together, but I think it should stay fix commit + test commit.

(I hope I'm not being too annoying about this! I don't want to waste anybody's time, including my own)

ScottPJones · 2015-08-20T01:42:06Z

Note: I have the removal of u8_reverse and string.reverse waiting in a branch, https://github.com/ScottPJones/julia/tree/spj/remu8reverse.
I just haven't made a PR out of it yet, because it needs this merged first.

hayd · 2015-08-20T16:46:57Z

A bug-fix and a test (that exposes that bug) should always be in the same commit. If you incorrectly resolve a merge conflict (assuming it's incorrect on src), the tests for that commit should fail. A good rule of thumb: if you can't switch the order of two commits in a PR they should be in the same commit.

ScottPJones · 2015-08-20T17:16:40Z

OK, that makes sense (although I've seen a lot of exceptions to that being merged in).
I'll rebase now.

Improve performance of reverse(str::UTF8String) Fix speedup Add tests for reverse of UTF8String

ScottPJones · 2015-08-20T17:19:57Z

@hayd Is that rule of thumb documented somewhere? I find it very useful, thanks for taking the time.

JeffBezanson · 2015-08-20T19:06:05Z

The intent of the original C code was to get good performance assuming the string is valid. IIRC the comment at the top of the file discusses this.

ScottPJones · 2015-08-20T19:28:12Z

The Julia code I wrote is frequently the same speed or faster (only issue is with lots of 3 or 4 byte UTF-8 characters).
Yes, I could have fixed the C code, however, last time I wanted to do that, I was told in no uncertain terms that C code wasn't really wanted.
If strings were actually validated, then a lot of things could be much faster.
The checks I added were to simply keep it from running over the end of the data, which could possibly lead to access violations.

JeffBezanson · 2015-08-20T20:24:54Z

I'm ok with this change, just explaining why things were that way.

ScottPJones · 2015-08-20T22:16:50Z

OK, thanks! Is there any way to efficiently do the some sort of type-punning as you had done in C, in Julia? (while keeping the sanity checks to keep it from going past the end, and having a fallback for platforms that don't allow unaligned access) Maybe using the Ptr type?

ScottPJones · 2015-08-21T14:11:25Z

Anything more needed here, or can it be merged? Thanks!

Fix several bugs in reverse(::UTF8String), add full coverage tests

IainNZ added the domain:unicode Related to unicode characters and encodings label Aug 16, 2015

kshyatt added the test This change adds or pertains to unit tests label Aug 17, 2015

ScottPJones force-pushed the spj/u8reverse branch from 5393344 to f15806e Compare August 18, 2015 04:13

ScottPJones force-pushed the spj/u8reverse branch from f15806e to 8d8d802 Compare August 18, 2015 22:32

ScottPJones force-pushed the spj/u8reverse branch from 8d8d802 to dac0ba1 Compare August 19, 2015 22:03

Fix bugs in reverse of UTF8String

b20d87e

Improve performance of reverse(str::UTF8String) Fix speedup Add tests for reverse of UTF8String

ScottPJones force-pushed the spj/u8reverse branch from dac0ba1 to b20d87e Compare August 20, 2015 17:18

JeffBezanson added a commit that referenced this pull request Aug 21, 2015

Merge pull request #12646 from ScottPJones/spj/u8reverse

4acdff2

Fix several bugs in reverse(::UTF8String), add full coverage tests

JeffBezanson merged commit 4acdff2 into JuliaLang:master Aug 21, 2015

ScottPJones deleted the spj/u8reverse branch August 21, 2015 15:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix several bugs in reverse(::UTF8String), add full coverage tests #12646

Fix several bugs in reverse(::UTF8String), add full coverage tests #12646

ScottPJones commented Aug 16, 2015

ScottPJones commented Aug 16, 2015

stevengj commented Aug 17, 2015

stevengj commented Aug 17, 2015

ScottPJones commented Aug 18, 2015

stevengj commented Aug 18, 2015

ScottPJones commented Aug 18, 2015

stevengj commented Aug 18, 2015

jakebolewski commented Aug 18, 2015

ScottPJones commented Aug 19, 2015

ScottPJones commented Aug 19, 2015

stevengj commented Aug 19, 2015

stevengj commented Aug 19, 2015

ScottPJones commented Aug 19, 2015

ihnorton commented Aug 19, 2015

stevengj commented Aug 19, 2015

ScottPJones commented Aug 19, 2015

ScottPJones commented Aug 19, 2015

stevengj commented Aug 19, 2015

stevengj commented Aug 19, 2015

ivarne commented Aug 19, 2015

ScottPJones commented Aug 19, 2015

ScottPJones commented Aug 20, 2015

hayd commented Aug 20, 2015

ScottPJones commented Aug 20, 2015

ScottPJones commented Aug 20, 2015

JeffBezanson commented Aug 20, 2015

ScottPJones commented Aug 20, 2015

JeffBezanson commented Aug 20, 2015

ScottPJones commented Aug 20, 2015

ScottPJones commented Aug 21, 2015

Fix several bugs in reverse(::UTF8String), add full coverage tests #12646

Fix several bugs in reverse(::UTF8String), add full coverage tests #12646

Conversation

ScottPJones commented Aug 16, 2015

ScottPJones commented Aug 16, 2015

stevengj commented Aug 17, 2015

stevengj commented Aug 17, 2015

ScottPJones commented Aug 18, 2015

stevengj commented Aug 18, 2015

ScottPJones commented Aug 18, 2015

stevengj commented Aug 18, 2015

jakebolewski commented Aug 18, 2015

ScottPJones commented Aug 19, 2015

ScottPJones commented Aug 19, 2015

stevengj commented Aug 19, 2015

stevengj commented Aug 19, 2015

ScottPJones commented Aug 19, 2015

ihnorton commented Aug 19, 2015

stevengj commented Aug 19, 2015

ScottPJones commented Aug 19, 2015

ScottPJones commented Aug 19, 2015

stevengj commented Aug 19, 2015

stevengj commented Aug 19, 2015

ivarne commented Aug 19, 2015

ScottPJones commented Aug 19, 2015

ScottPJones commented Aug 20, 2015

hayd commented Aug 20, 2015

ScottPJones commented Aug 20, 2015

ScottPJones commented Aug 20, 2015

JeffBezanson commented Aug 20, 2015

ScottPJones commented Aug 20, 2015

JeffBezanson commented Aug 20, 2015

ScottPJones commented Aug 20, 2015

ScottPJones commented Aug 21, 2015