Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix #11141/#10973 and improve performance of is_valid_utf8/is_valid_ascii #11203

Merged
merged 3 commits into from
May 15, 2015

Conversation

ScottPJones
Copy link
Contributor

Rewrote u8_isvalid to fix cases of not detecting single or out of order surrogate characters.
Also improved performance, generally around 2-3x faster on simple tests with different types
of characters (ASCII, Latin1, 2-byte/3-byte/4-byte UTF-8 sequences).
Fixed some tests that were depending on being able to make invalid string literals.

@ScottPJones ScottPJones changed the title Spj/fixvalidutf8 Fix #11141 and improve performance of is_validate_utf8/is_validate_ascii May 8, 2015

length is in bytes, since without knowing whether the string is valid
it's hard to know how many characters there are! */
int u8_isvalid(const char *str, size_t length)
int u8_isvalid(const char *iStr, size_t iLength)
Copy link
Sponsor Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's with these argument names? Is this Hungarian notation?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just the coding convention where I'm consulting... not Hungarian...
i for input, io for input/output, o for output, for parameters...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is using that convention a problem? (I dislike Hungarian myself, this was something that the other developers had wanted, just for the parameters, similar to Google's requirement of putting a "p" on formal parameter names...). I hadn't found any C/C++ style guide for Julia internals, just the guide for Julia code. Thanks!

Copy link
Sponsor Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would make sense to follow the style of the surrounding code, which is not to do this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I just copied this from my own UTF handling code, and changed the name to match Jeff's name, and removed the documentation :sad:
When I have all of u8 rewritten to run faster, can I use my standard conventions, since there will be no other surrounding code?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@StefanKarpinski Because when I had copied code of mine previously into utf8.c, I was told to remove the documentation, because nobody had decided yet how code should be documented 😢

@tknopp I don't mean to sound snide, but, if I'd actually found any consistent conventions in the code I've been looking at, I would have followed it...
I've been working mostly in utf8.c, utf8proc.c, and gc.c. Each one has different conventions for how the code is formatted, [2 vs 4 spaces for indentation, for example, different ways of indenting {}'s, etc.], and different ways things are named. There are things like jlegal_func (is that a legal function? hmmm...) or should that really have been jl_egal_func, as were some of the other functions, such as jl_alloc_svec_func.

Having authored the coding / style / portability standard for a company with hundreds of programmers, that is still in use after 28 years, and co-authored the guide for the start-up I'm at now, I feel not having some sort of guide for C/C++ programming is a serious lack for Julia.
I've found that consistency, external and internal documentation, good readable style, as well as having portable code, is incredibly important when you have lots of contributors over large periods of time.
The code now is not very consistent, there doesn't seem to be any internal documentation or comments to explain whatever tricks or design / implementation decisions were made, use of too short variable names makes it hard to read parts of the code (using a single letter variable name, that is a substring of several other variable names in the same function, for example)...
lots of static variables that will make adding threading very difficult... I could go on, but it would just frustrate me more...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I'd seen that paragraph, but it says nothing about variable naming... and isn't followed in the code I've seen.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes you are right about variable naming. I just wanted to reply to your complaint about formatting.

I can absolutely see some of your other points. For instance it would be a good move to put globals into structs so that there is a julia context that can be independently hold. But one has also to acknowledge that the C core of julia is moving fast and maintained/developed by a small group of people. Taking this into account development has been pretty efficient IMHO.

If you want to help improving the C core I suggest opening a new issue and making concrete proposals that the core hackers can judge.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tknopp to be fair, I think @ScottPJones has gotten the message about opening concrete, independent, small issues by now. See #11123 for an issue about doc comments. On doc comments and variable naming I think it might be best to just start proposing small PR's to solely add doc comments or replace variable names with clearer, more consistent versions. They may stand out as not matching the surrounding code at first, but if a consistent style starts getting gradually applied to more of the C code then I think it could be fair to start holding newly added code to a standard (which we would add to CONTRIBUTING.md).

@ScottPJones ScottPJones changed the title Fix #11141 and improve performance of is_validate_utf8/is_validate_ascii Fix #11141/#10973 and improve performance of is_validate_utf8/is_validate_ascii May 8, 2015
@tkelman tkelman added the domain:unicode Related to unicode characters and encodings label May 9, 2015
@tkelman
Copy link
Contributor

tkelman commented May 9, 2015

+ a bunch for the level of comments here.

For indentation and purely cosmetic code formatting conventions I'd be in favor of adopting a "run clang-format on it" policy, ideally with formatting-only changes split into separate commits from functional changes. Variable naming is harder to enforce.

@ScottPJones
Copy link
Contributor Author

I did try to follow all of the formatting stuff I'd found (4 space indentation, no tab characters, no trailing spaces on a line, { } placement...). I've never used "clang-format", I'll have to investigate that... thanks!

@ScottPJones ScottPJones changed the title Fix #11141/#10973 and improve performance of is_validate_utf8/is_validate_ascii Fix #11141/#10973 and improve performance of is_valid_utf8/is_valid_ascii May 9, 2015
@ScottPJones
Copy link
Contributor Author

Anything left keeping this from being merged into base? Thanks!

@tkelman
Copy link
Contributor

tkelman commented May 13, 2015

I don't think the argument names matter enough to prevent this from being merged, but I don't know enough of the unicode details to be qualified to review the functional changes here. @StefanKarpinski or @stevengj or @jiahao can you take a look and hit the button if everything is in order?

@kmsquire
Copy link
Member

Any particular reason for the test changes? They look mostly cosmetic, and to me, the changes are not clearly better to me, but I might be missing something.

Also, when all edits are done, the commits should be squashed.

@ScottPJones
Copy link
Contributor Author

What do you mean? I added a very complete set of tests, with all of the edge conditions, and tests for UTF8String, which didn't have any before...

@tkelman
Copy link
Contributor

tkelman commented May 14, 2015

I think you might be mixing up the different PR's, here 88a8071 makes a few small changes to a test file - as described up top

Fixed some tests that were depending on being able to make invalid string literals.

@ScottPJones
Copy link
Contributor Author

The changes to test/strings.jl in 88a8071 were not cosmetic at all... they are necessary when the string literals no longer accept invalid UTF-8.

@tkelman
Copy link
Contributor

tkelman commented May 14, 2015

In this case, while I don't usually like having commits that fail tests present in the history of master, separating out the commits is actually good for license's sake. Note the comment based on the valid_utf8 routine from the PCRE library by Philip Hazel, about the old implementation which is being deleted here.

@ScottPJones
Copy link
Contributor Author

@tkelman It's too early for my brain... what would you like me to change?

@tkelman
Copy link
Contributor

tkelman commented May 14, 2015

Nothing from me yet, unless someone who knows unicode says otherwise. Was mostly responding to

Also, when all edits are done, the commits should be squashed.

where here it's actually useful to have the clean separation of new from-scratch version in a separate commit from deleting the old one.

ScottPJones added a commit to ScottPJones/julia that referenced this pull request May 14, 2015
@kmsquire
Copy link
Member

@ScottPJones, regarding the tests, thanks for the explanation--that's what I was missing.

@tkelman, re:squashing, I agree that separate commits are sometimes useful, and you make a good point.

Scott, can you squash only the first and third commits? You can do this with git rebase -i HEAD~3 on this branch, swapping the second and third commits, and then squashing the second into the first.

(For why broken commits can be an issue, look up "git bisect".)

I wound also suggest renaming the parameters to match the surrounding code, and consider clarifications to the coding conventions separately. (Ideally the rename would also be squashed into the first commit.)

(If you're concerned about screwing up git history, you could make a branch off of this branch first, and practice these changes there.)

Cheers!

@ScottPJones
Copy link
Contributor Author

OK, all done! Please let me know if I’ve somehow screwed this up again! 😀 Thanks everybody!

@stevengj
Copy link
Member

Looks good, but it seems that there are currently no test cases of is_valid_utf8 in test/strings.jl, and I wouldn't like to see changes to such a basic routine merged unless there are more extensive tests.

Maybe something like:

let tests = (("abcd".data,true), [("α".data[i:i],false) for i in 1:2]..., ("🐨".data[1:3],false), ("🐨".data, true), ("\ud800".data, false))
    for (d1,v1) in tests
        for (d2,v2) in tests
            for (d3,v3) in tests
                @test is_valid_utf8(vcat(d1,d2,d3)) == v1 && v2 && v3
            end
        end
    end
end

(and add any other edge cases as (data, isvalid) pairs to the tests variable)

@ScottPJones
Copy link
Contributor Author

OK, I put them in as part of the is_valid_* -> invalid(T change, I’ll rename them and put them in here as well. Your method is nice... I’ll have to change it a bit though (“\ud800” gives a error now with one of my changes, it is an invalid string that wasn’t caught before).

@jakebolewski
Copy link
Member

looks great

@stevengj
Copy link
Member

Just replace "\ud800".data with [0xed,0xa0,0x80]

@stevengj
Copy link
Member

There should also be a

@test is_valid_utf8(UInt8[])

Change parameters in utf8.c
@ScottPJones
Copy link
Contributor Author

OK, should be all set, and resquashed

@tkelman
Copy link
Contributor

tkelman commented May 14, 2015

Looks to be failing the new tests?

@ScottPJones
Copy link
Contributor Author

Yes, just saw that... don't know what's different from on my system... will try to get that fixed shortly

@ScottPJones
Copy link
Contributor Author

OK, I was playing with too many branches at the same time, and I didn't actually test those particular lines from @stevengj. After looking at it further, I realized that his tests simply are not correct, you can't mix and match incomplete UTF-8 sequences like that, because you can have two sequences that are invalid on their own, which put back together, are a valid sequence of bytes, which is esp. likely because of the way the code takes apart the bytes of the UTF-8 representation of a valid character, and then vcat's things together... in this test, sometimes two wrongs do make a right (which then causes the test to fail!) I'll have the tests in order shortly...

@ScottPJones
Copy link
Contributor Author

I’ve updated the test cases... I hope you’ll all agree that I did a fairly comprehensive set of test cases for UTF-8! Let me know if there is anything else to prevent this from being merged... Thanks!

stevengj added a commit that referenced this pull request May 15, 2015
Fix #11141/#10973 and improve performance of is_valid_utf8/is_valid_ascii
@stevengj stevengj merged commit 2f019d7 into JuliaLang:master May 15, 2015
@stevengj
Copy link
Member

Looks good, thanks!

@stevengj
Copy link
Member

Ah shoot, I just noticed you hadn't squashed the final commit; I probably should have waited for that before merging.

@ScottPJones ScottPJones deleted the spj/fixvalidutf8 branch May 15, 2015 14:41
@ScottPJones
Copy link
Contributor Author

Ah, I missed squashing that puppy... my other two that I think are totally ready to go (#11186 and #11241) are both squashed down to 1 commit. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
domain:unicode Related to unicode characters and encodings
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants