Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix several bugs in reverse(::UTF8String), add full coverage tests #12646

Merged
merged 1 commit into from Aug 21, 2015

Conversation

ScottPJones
Copy link
Contributor

reverse on a UTF8String used the C function u8_reverse, which I discovered in testing has several bugs.

  1. It doesn't detect running off the end of the string when there is a char > 0x80
  2. It picks up garbage bytes depending on the lead character
  3. It is not portable to any machine that requires alignment.

I have rewritten it in Julia, and added tests that fully cover the function.
I wanted to remove u8_reverse from src/support/utf8.c, however that function is used by flisp for the string.reverse function, even though that function is apparently never used anywhere in any of the .scm code I have found in Base.
I wonder if the unused string functions in flisp, that are depending on broken C code, can simply be removed and save some space.

@IainNZ IainNZ added the domain:unicode Related to unicode characters and encodings label Aug 16, 2015
@ScottPJones
Copy link
Contributor Author

@JeffBezanson might want to comment about the issues with wanting to remove unused functions from utf8.c and from flisp.

@kshyatt kshyatt added the test This change adds or pertains to unit tests label Aug 17, 2015
@stevengj
Copy link
Member

How does the performance compare to the C version?

@stevengj
Copy link
Member

string.reverse is not standard Scheme. (As far as I can tell, there is no string-reversal function in R7RS; Guile has string-reverse.) So, it is probably reasonable to remove it and the corresponding C code.

@ScottPJones
Copy link
Contributor Author

@stevengj I haven't had time to do hard-core benchmarking yet (and got hammered when I mentioned performance as opposed to bug fixes as a goal not that long ago!). So far looks OK, but I've got to test all cases. I'm not worried though about performance, I've learned that I can make Julia generally as fast as C with a small amount of effort! 😀

@stevengj
Copy link
Member

@ScottPJones, just a couple of typical numbers for random strings would be helpful.

@ScottPJones
Copy link
Contributor Author

@stevengj I will, I just have some paid work to get done tonight, if I hope to get some sleep!

@stevengj
Copy link
Member

Note the test failure:

LoadError("C:\\projects\\julia\\test\\cmdlineargs.jl",3,ErrorException("test failed: startswith(readall(@cmd(\"\\\$exename -h\")),\"julia [options] [program] [args...]\")\n in expression: startswith(readall(@cmd(\"\\\$exename -h\")),\"julia [options] [program] [args...]\")"))
 in error at error.jl:21
 in remotecall_fetch at multi.jl:728
while loading C:\projects\julia\test\runtests.jl, in expression starting on line 13
WARNING: Forcibly interrupting busy workers From worker 3:       * cmdlineargs           

from

@test startswith(readall(`$exename -h`), "julia [options] [program] [args...]")

in cmdlineargs.jl.

@jakebolewski
Copy link
Member

@ScottPJones that was my fault. You just need to rebase this PR on the latest master.

@ScottPJones
Copy link
Contributor Author

@stevengj This is mostly faster than the C code, or the same when dealing with strinsgs with mostly ASCII or Latin1 characters, it's slower when dealing with characters that take 4 bytes in UTF-8 (> 0xffff). I put both the testing routine and results from my laptop in this gist: https://gist.github.com/ScottPJones/8feed7aa12f4ab25e76b
The C code is faster in those situations because it doesn't correctly check if it runs past the end of the string, and it uses *(uint32_t *) to copy all 4 bytes at a time (even though that is not really portable, some machines require alignment, or a __unaligned keyword to tell the compiler to use a different instruction), so I'm not so concerned about that.

@ScottPJones
Copy link
Contributor Author

@jakebolewski No problem! I had to rebase anyway after I ran full benchmarks and saw I needed to improve the performance a bit more.

@stevengj
Copy link
Member

@ScottPJones, looks great! The slight performance penalty for safety looks perfectly fine here.

@stevengj
Copy link
Member

I think you should go ahead and delete the old C code and the corresponding flisp code in this PR. No point in keeping buggy unused code in the tree just to implement a non-standard unused function in Scheme.

@ScottPJones
Copy link
Contributor Author

Yes, I was planning on doing that as soon as this gets merged, thanks!

@ihnorton
Copy link
Member

You can do the removal as another commit on this branch. (also need to squash the three existing commits)

@stevengj
Copy link
Member

It would be cleaner to have a single PR.

@ScottPJones
Copy link
Contributor Author

OK, other people have asked me to do focused single issue PRs, and I would think removing something from flisp is really separate from this.
About the squashing, others have also asked me to have things logically separate, i.e. bug fixing in one commit, new tests in another (in the same PR). (I've also been watching how Kate does it, with a number of logically distinct commits in a PR).

@ScottPJones
Copy link
Contributor Author

(if you really want it as a single PR, and only 2 commits, I'll do so, I just want to be sure what to do, in the face of conflicting requests at different times)

@stevengj
Copy link
Member

@ScottPJones, if your PR replaces X with Y, then removing X is a part of the same issue.

@stevengj
Copy link
Member

The main reason to separate bugfix commits is so that they can be backported, but that's not going to happen with this commit (it depends on too many other changes that have been made in 0.4). In this case, I would tend to squash everything into a single commit because they are all a logical unit ("replace X with Y and test"). But I don't think it's a big deal as-is.

@ivarne
Copy link
Sponsor Member

ivarne commented Aug 19, 2015

How to partition a set of changes into commits and pull requests is a mostly stylistic issue, but for some workflows (eg. git bisect, git cherry-pick and git revert), it makes an important difference. The goal is usually to make it easy to review, and different people have slightly different preferences. Key aspects in reviewability is size (smaller is better), single concern (independent of other changes) and completeness. Some seemingly contradictory advice is inevitable with different reviewers, especially when they put different values on each of the aspects and you don't understand the reasoning behind their requests.

@ScottPJones
Copy link
Contributor Author

if your PR replaces X with Y, then removing X is a part of the same issue.

If X depends on A, and Z depends on A, and a PR rewrites X to no longer depend on A,
I don't think that necessarily implies that A should be removed in the same PR.

That is why I really think that that (which might have some bikeshedding about removing functionality from flisp, broken and unused or not), should be in a separate PR, and not hold this up.

About squashing, Stefan had praised me the last time for having separate commits for doing what Kate normally does, i.e. separate commits in a PR for the bug fixing vs. the testing, and I also just this last week got told I made a mistake by putting bug fix / test in the same PR.
I'll squash the first and third commit together, but I think it should stay fix commit + test commit.

(I hope I'm not being too annoying about this! I don't want to waste anybody's time, including my own)

@ScottPJones
Copy link
Contributor Author

Note: I have the removal of u8_reverse and string.reverse waiting in a branch, https://github.com/ScottPJones/julia/tree/spj/remu8reverse.
I just haven't made a PR out of it yet, because it needs this merged first.

@hayd
Copy link
Member

hayd commented Aug 20, 2015

A bug-fix and a test (that exposes that bug) should always be in the same commit. If you incorrectly resolve a merge conflict (assuming it's incorrect on src), the tests for that commit should fail. A good rule of thumb: if you can't switch the order of two commits in a PR they should be in the same commit.

@ScottPJones
Copy link
Contributor Author

OK, that makes sense (although I've seen a lot of exceptions to that being merged in).
I'll rebase now.

Improve performance of reverse(str::UTF8String)

Fix speedup

Add tests for reverse of UTF8String
@ScottPJones
Copy link
Contributor Author

@hayd Is that rule of thumb documented somewhere? I find it very useful, thanks for taking the time.

@JeffBezanson
Copy link
Sponsor Member

The intent of the original C code was to get good performance assuming the string is valid. IIRC the comment at the top of the file discusses this.

@ScottPJones
Copy link
Contributor Author

The Julia code I wrote is frequently the same speed or faster (only issue is with lots of 3 or 4 byte UTF-8 characters).
Yes, I could have fixed the C code, however, last time I wanted to do that, I was told in no uncertain terms that C code wasn't really wanted.
If strings were actually validated, then a lot of things could be much faster.
The checks I added were to simply keep it from running over the end of the data, which could possibly lead to access violations.

@JeffBezanson
Copy link
Sponsor Member

I'm ok with this change, just explaining why things were that way.

@ScottPJones
Copy link
Contributor Author

OK, thanks! Is there any way to efficiently do the some sort of type-punning as you had done in C, in Julia? (while keeping the sanity checks to keep it from going past the end, and having a fallback for platforms that don't allow unaligned access) Maybe using the Ptr type?

@ScottPJones
Copy link
Contributor Author

Anything more needed here, or can it be merged? Thanks!

JeffBezanson added a commit that referenced this pull request Aug 21, 2015
Fix several bugs in reverse(::UTF8String), add full coverage tests
@JeffBezanson JeffBezanson merged commit 4acdff2 into JuliaLang:master Aug 21, 2015
@ScottPJones ScottPJones deleted the spj/u8reverse branch August 21, 2015 15:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
domain:unicode Related to unicode characters and encodings test This change adds or pertains to unit tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

9 participants