-
Notifications
You must be signed in to change notification settings - Fork 140
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix #34 handle 66 Unicode non-characters and surrogates correctly #35
Conversation
|
Or possibly it's now testing non-characters because the definition changed. |
The errors are something to do with graphemes, which I didn't think my changes would even affect. |
Given that both Travis builds failed (AppVeyor does not run the tests), it doesn't look like a random failure. It doesn't look unlikely to me that modifying an essential function like |
@nalimilan What makes me wonder a bit, is all the warning messages in the travis compilation... |
Would also be good to add a regression test to the |
Note: I figured out why the grapheme test failed... but it doesn't make me very happy about utf8proc. |
Yes, the Alternatively, we could change it to insert NUL bytes as delimiters, but it would be good to change the API (e.g. to By test case, I mean creating an additional test program that checks for regressions on handling of these non-characters and surrogates. |
@@ -49,7 +51,7 @@ int main(int argc, char **argv) | |||
else | |||
i++; | |||
} | |||
glen = utf8proc_map(utf8, j, &g, UTF8PROC_CHARBOUND); | |||
glen = utf8proc_map(utf8, j, &g, UTF8PROC_CHARBOUND | UTF8PROC_OLDENCODE); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't UTF8PROC_CHARBOUND
imply UTF8PROC_OLDENCODE
? Why have a separate flag?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd missed that (about CHARBOUND...).
OK, I'll try to get some testing in... (in my copious spare time! 😀)
Would it be possible to have extra testing as a follow-up PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general, it's good practice to merge patches only given regression tests and (where needed) documentation (though none is needed in this case if there are no API changes).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, true... I'll add some further testing, and make sure it gets all of the edge cases that were broken before... there are no API changes though.
} | ||
|
||
/* internal "unsafe" version that does not check whether uc is in range */ | ||
static const utf8proc_property_t *utf8proc_unsafe_get_property(utf8proc_int32_t uc) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this is a static
function, there is no need to put the utf8proc_
prefix. It's good to preserve the convention that exported (non-static
) functions get the prefix, whereas non-exported functions have no prefix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added those because I mistakenly thought that's what you were asking me to do... I'll go back and put those back... (this is what you get for programming on 3 days of only 2 hours of sleep!)
Once 39 is merged, can you update the tests to also test for noncharacter validity and include the updated tests in this PR? |
#39 already does test for noncharacter validity... If you run it now, it will show you all the problems you have if you don't merge this one 😀 |
@ScottPJones, then it should be merged into this PR (and squashed/rebased). The tests should be part of the PR on general principle, but also as a practical matter: unless the tests are part of the PR, we can't tell on github (until after everything is merged) that the PR actually passes the tests. |
…rformance and surrogate handling
Oh, you should also update the |
But then, isn't it harder to show issues that haven't been fixed yet? |
Oh, didn't know about .gitignore... Thanks! |
Committing tests that fail is not really compatible with the Travis/Github workflow. If you don't combine the PRs, then we are forced to either (a) merge 39 first, in which case you break the build (tests passing/failing is binary on Github) or (b) merge 35 first, in which case we are merging before testing. A more fine-grained testing framework might be nice, but that's not an issue we can take up here. |
I agree this would be a major improvement. There have been quite a few times where I've had to omit a test that should have been there simply because some other bug made it fail. It would be really nice to have a "todo list" of failing tests. That said, certainly our current framework doesn't support this. |
@stevengj The way the unit tests would work within the Travis/GitHub framework is fairly simple... A "failure" for Travis would only be when there is a regression on a test that passed previously...
What do you think? |
@ScottPJones, that's reasonable, but since you have to build this facility "manually" on github I wouldn't bother on a project as small as utf8proc. For Julia, it might be worthwhile to extend |
size_t encode(char *dest, const char *buf) | ||
{ | ||
size_t i = 0, j, d = 0; | ||
do { | ||
for (;;) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, is there some substantive reason you changed this? Don't most compilers generate exactly the same code for for (;;) {...}
and do {...} while(1)
?
Usually, it's best to avoid purely stylistic changes in patches (unless you have a patch that is purely for a global style change); on the contrary, one strives for a patch that matches the surrounding code style.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't even remember doing that... that sort of thing happens on autopilot... I date myself 😀, but when I started, not all C compilers optimized while(1)
away!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Anyway, do I need to change that back and resquash?
@stevengj You're quite right, but maybe this should be raised as an issue? To improve unit testing for Julia? |
There are lots of nice small usable test frameworks out there that are better than what we have now - we could just pick one and start using it. I like the one that libgit2 uses for example: https://github.com/vmg/clar |
@tkelman I'm pretty ignorant of a lot of what's out there in OSS land... (too many years with "not invented here" mentality where I worked... unfortunately I wasn't able to change that...) |
dst[0] = 0xFE; | ||
return 1; | ||
// Note: we allow encoding 0xd800-0xdfff here, so as not to change | ||
// the API, however, these are actually invalid in UTF-8 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In that case, do we still need Oh, never mind, I see that it is still used for the utf8proc_unsafe_encode_char
?CHARBOUND
thing.
Not sure exactly what you mean, but the main thing a nicer test framework like clar (and there are even fancier ones out there, I'm sure) would do is run all the tests and give you a summary of failures at the end. |
I just started looking at clar, but (in my really quick 50,000 feet flying look), it didn't seem to have built in tracking of previous results... but 1) I'm jetlagged, 2) need to look further... |
Oh, I forgot to say, it compared with the last known good build along that branch... then if there are no regressions detected, the build is marked OK, and it's results (if different), are saved and become the new point for future comparisons in the branch. |
Fix #34 handle 66 Unicode non-characters and surrogates correctly
Thanks! |
This change also improves the performance of the functions utf8proc_iterate and utf8proc_codepoint_valid, and fixes surrogate handling in utf8proc_encode_char.