Fix #34 handle 66 Unicode non-characters and surrogates correctly #35

ScottPJones · 2015-05-09T15:48:36Z

This change also improves the performance of the functions utf8proc_iterate and utf8proc_codepoint_valid, and fixes surrogate handling in utf8proc_encode_char.

pao · 2015-05-09T16:31:40Z

make check failed on Travis, possible encoding problem? (Fancy that in a Unicode library.)

pao · 2015-05-09T16:33:49Z

Or possibly it's now testing non-characters because the definition changed.

ScottPJones · 2015-05-09T17:14:47Z

The errors are something to do with graphemes, which I didn't think my changes would even affect.
There are also a number of warnings about a wcwidth function, on both gcc and clang, that I didn't touch :sad:

nalimilan · 2015-05-09T20:49:38Z

Given that both Travis builds failed (AppVeyor does not run the tests), it doesn't look like a random failure. It doesn't look unlikely to me that modifying an essential function like utf8proc_encode_char could change the behaviour of tests. Even if you didn't change these functions, they probably call the modified code indirectly.

ScottPJones · 2015-05-09T21:12:18Z

@nalimilan What makes me wonder a bit, is all the warning messages in the travis compilation...
I'll dig into it though

stevengj · 2015-05-11T15:49:57Z

Would also be good to add a regression test to the tests/ directory.

ScottPJones · 2015-05-11T22:42:44Z

Note: I figured out why the grapheme test failed... but it doesn't make me very happy about utf8proc.
I managed to work around it, but basically, the tests depended on the utf8 encoding being broken!
(it encoded some values as 0xFF and 0xFE, which it then used as markers, but left that bad encoding
for all callers to the utf8proc_encode_char function).
@stevengj This already has fairly complete tests..., are you talking about additional tests in the julia tests unit tests?

stevengj · 2015-05-12T14:58:38Z

Yes, the CHARBOUND option was documented as inserting 0xFF bytes as delimiters. I'd be happy to deprecate this functionality, since I find utf8proc_grapheme_break to be much more useful (because it doesn't require making a copy of the string etcetera just to detect grapheme breaks).

Alternatively, we could change it to insert NUL bytes as delimiters, but it would be good to change the API (e.g. to UTF8PROC_GRAPHEME) in that case to force a breaking change for code relying on this feature.

By test case, I mean creating an additional test program that checks for regressions on handling of these non-characters and surrogates.

stevengj · 2015-05-12T14:59:35Z

test/graphemetest.c

@@ -49,7 +51,7 @@ int main(int argc, char **argv)
                else
                    i++;
            }
-            glen = utf8proc_map(utf8, j, &g, UTF8PROC_CHARBOUND);
+            glen = utf8proc_map(utf8, j, &g, UTF8PROC_CHARBOUND | UTF8PROC_OLDENCODE);


Shouldn't UTF8PROC_CHARBOUND imply UTF8PROC_OLDENCODE? Why have a separate flag?

I'd missed that (about CHARBOUND...).
OK, I'll try to get some testing in... (in my copious spare time! 😀)
Would it be possible to have extra testing as a follow-up PR?

In general, it's good practice to merge patches only given regression tests and (where needed) documentation (though none is needed in this case if there are no API changes).

Yes, true... I'll add some further testing, and make sure it gets all of the edge cases that were broken before... there are no API changes though.

ScottPJones · 2015-05-29T16:46:26Z

@stevengj Could you also please take a look at my new PR #39 here?
I added some reasonably extensive testing of both the codepoint_valid and iterate functions...
I did this as a separate PR, so you could easily see the issues I found with the current utf8proc.c code...
Thanks greatly!

stevengj · 2015-05-29T16:56:32Z

utf8proc.c

+}
+
+/* internal "unsafe" version that does not check whether uc is in range */
+static const utf8proc_property_t *utf8proc_unsafe_get_property(utf8proc_int32_t uc) {


Since this is a static function, there is no need to put the utf8proc_ prefix. It's good to preserve the convention that exported (non-static) functions get the prefix, whereas non-exported functions have no prefix.

I added those because I mistakenly thought that's what you were asking me to do... I'll go back and put those back... (this is what you get for programming on 3 days of only 2 hours of sleep!)

stevengj · 2015-05-29T17:00:58Z

Once 39 is merged, can you update the tests to also test for noncharacter validity and include the updated tests in this PR?

ScottPJones · 2015-05-29T17:09:28Z

#39 already does test for noncharacter validity... If you run it now, it will show you all the problems you have if you don't merge this one 😀

stevengj · 2015-05-29T17:43:16Z

@ScottPJones, then it should be merged into this PR (and squashed/rebased).

The tests should be part of the PR on general principle, but also as a practical matter: unless the tests are part of the PR, we can't tell on github (until after everything is merged) that the PR actually passes the tests.

…rformance and surrogate handling

stevengj · 2015-05-29T17:54:27Z

Oh, you should also update the .gitignore file with the names of the new test programs (so that git status will ignore them).

ScottPJones · 2015-05-29T17:56:30Z

But then, isn't it harder to show issues that haven't been fixed yet?
Actually, that's something that I think is a problem with the way the current unit testing works...
I think that you should be able to independently of fixes, be able to add unit tests for any problems that have been found... and the unit test framework should check for regressions on any particular test, while saving and displaying information about those that fail, instead of terminating the testing process...
(I worked this way for many years, it was very effective in not losing track of issues in the product, while showing where progress was being made in cleaning up old issues)

ScottPJones · 2015-05-29T17:56:51Z

Oh, didn't know about .gitignore... Thanks!

stevengj · 2015-05-29T18:01:27Z

Committing tests that fail is not really compatible with the Travis/Github workflow. If you don't combine the PRs, then we are forced to either (a) merge 39 first, in which case you break the build (tests passing/failing is binary on Github) or (b) merge 35 first, in which case we are merging before testing.

A more fine-grained testing framework might be nice, but that's not an issue we can take up here.

timholy · 2015-05-29T18:02:35Z

I think that you should be able to independently of fixes, be able to add unit tests for any problems that have been found.

I agree this would be a major improvement. There have been quite a few times where I've had to omit a test that should have been there simply because some other bug made it fail. It would be really nice to have a "todo list" of failing tests.

That said, certainly our current framework doesn't support this.

ScottPJones · 2015-05-29T18:10:08Z

@stevengj The way the unit tests would work within the Travis/GitHub framework is fairly simple... A "failure" for Travis would only be when there is a regression on a test that passed previously...
New tests show what the behavior is supposed to be, and the ones that don't match, get counted up,
and so you'd see something like:

Unicode Tests:
     Line 30, 'isvalid(UTF8String, b"\xff") -> false' failed, returned true
     .... 
  1000 passed, 50 failed, 3.53 seconds

What do you think?

stevengj · 2015-05-29T18:13:21Z

@ScottPJones, that's reasonable, but since you have to build this facility "manually" on github I wouldn't bother on a project as small as utf8proc. For Julia, it might be worthwhile to extend Base.Test with a @todo test (for something that is expected to fail) or similar.

ScottPJones · 2015-05-29T18:13:44Z

@stevengj Thanks for all the review! Hope I got the move of #39 here done correctly...

stevengj · 2015-05-29T18:16:27Z

test/tests.h

 size_t encode(char *dest, const char *buf)
 {
     size_t i = 0, j, d = 0;
-     do {
+     for (;;) {


BTW, is there some substantive reason you changed this? Don't most compilers generate exactly the same code for for (;;) {...} and do {...} while(1)?

Usually, it's best to avoid purely stylistic changes in patches (unless you have a patch that is purely for a global style change); on the contrary, one strives for a patch that matches the surrounding code style.

I don't even remember doing that... that sort of thing happens on autopilot... I date myself 😀, but when I started, not all C compilers optimized while(1) away!

Anyway, do I need to change that back and resquash?

ScottPJones · 2015-05-29T18:17:42Z

@stevengj You're quite right, but maybe this should be raised as an issue? To improve unit testing for Julia?
We actually had a system that kept track of the results of previous unit tests, in CSV files... so it wasn't really "manual", but 1) I didn't build the unit testing framework, 2) it's not the sort of programming I really want to do 😀, but I know it can be done and is very very useful, and would be good if somebody smart could implement it... (sort of intermediate beginner project? or Summer of Code?)

tkelman · 2015-05-29T18:19:06Z

There are lots of nice small usable test frameworks out there that are better than what we have now - we could just pick one and start using it. I like the one that libgit2 uses for example: https://github.com/vmg/clar

ScottPJones · 2015-05-29T18:23:10Z

@tkelman I'm pretty ignorant of a lot of what's out there in OSS land... (too many years with "not invented here" mentality where I worked... unfortunately I wasn't able to change that...)
Does vmg/clar have the capabilities that I described, of checking for regressions on each individual test?

stevengj · 2015-05-29T18:23:30Z

utf8proc.c

-    dst[0] = 0xFE;
-    return 1;
+  // Note: we allow encoding 0xd800-0xdfff here, so as not to change
+  // the API, however, these are actually invalid in UTF-8


~~In that case, do we still need utf8proc_unsafe_encode_char?~~ Oh, never mind, I see that it is still used for the CHARBOUND thing.

tkelman · 2015-05-29T18:29:43Z

checking for regressions on each individual test?

Not sure exactly what you mean, but the main thing a nicer test framework like clar (and there are even fancier ones out there, I'm sure) would do is run all the tests and give you a summary of failures at the end.

ScottPJones · 2015-05-29T18:37:13Z

I just started looking at clar, but (in my really quick 50,000 feet flying look), it didn't seem to have built in tracking of previous results... but 1) I'm jetlagged, 2) need to look further...
The way the InterSystems test suite worked (IIRC) is that it would tag a potential release point (for the nightly build, for example), using Perforce, and run the tests, saving the results of all the tests...
(it also had files that had the correct results). It would then compare (and it had to check if tests were added or deleted) each item, to see if the result changed... if it changed, and didn't match the correct result, that was deemed a regression, and that build would be considered failed.
You'd get a nice report with how many new tests passed, how many new tests failed (along with more information to help debug each individual one), how many old tests passed, and how many failed (but did not change the result), and how many changed behavior and failed... you could also run the test suite locally... or subsections of it...

ScottPJones · 2015-05-29T18:45:44Z

Oh, I forgot to say, it compared with the last known good build along that branch... then if there are no regressions detected, the build is marked OK, and it's results (if different), are saved and become the new point for future comparisons in the branch.

Fix #34 handle 66 Unicode non-characters and surrogates correctly

ScottPJones · 2015-05-30T02:35:31Z

Thanks!

#35 and #40 added new tests that #38 did not take into account this is one case where it would be good if Travis re-tested the PR after new commits get pushed to master

stevengj reviewed May 12, 2015
View reviewed changes

stevengj reviewed May 29, 2015
View reviewed changes

Fix JuliaStrings#34 handle 66 Unicode non-characters, also improve pe…

6249e6b

…rformance and surrogate handling

ScottPJones force-pushed the spj/valid branch from ce8e3b3 to 6249e6b Compare May 29, 2015 17:51

stevengj mentioned this pull request May 29, 2015

Add tests for valid codepoints and iterate function #39

Closed

Add tests for valid codepoints and iterate function

6a229a6

stevengj reviewed May 29, 2015
View reviewed changes

stevengj added a commit that referenced this pull request May 30, 2015

Merge pull request #35 from ScottPJones/spj/valid

35ec8e3

Fix #34 handle 66 Unicode non-characters and surrogates correctly

stevengj merged commit 35ec8e3 into JuliaStrings:master May 30, 2015

ScottPJones deleted the spj/valid branch May 30, 2015 08:25

tkelman added a commit that referenced this pull request May 30, 2015

Fix make check

f7219d5

#35 and #40 added new tests that #38 did not take into account this is one case where it would be good if Travis re-tested the PR after new commits get pushed to master

stevengj mentioned this pull request Mar 30, 2019

Question about unsafe_encode_char() #144

Closed

Fix #34 handle 66 Unicode non-characters and surrogates correctly #35

Fix #34 handle 66 Unicode non-characters and surrogates correctly #35

Conversation

ScottPJones commented May 9, 2015

pao commented May 9, 2015

pao commented May 9, 2015

ScottPJones commented May 9, 2015

nalimilan commented May 9, 2015

ScottPJones commented May 9, 2015

stevengj commented May 11, 2015

ScottPJones commented May 11, 2015

stevengj commented May 12, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ScottPJones commented May 29, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stevengj commented May 29, 2015

ScottPJones commented May 29, 2015

stevengj commented May 29, 2015

stevengj commented May 29, 2015

ScottPJones commented May 29, 2015

ScottPJones commented May 29, 2015

stevengj commented May 29, 2015

timholy commented May 29, 2015

ScottPJones commented May 29, 2015

stevengj commented May 29, 2015

ScottPJones commented May 29, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ScottPJones commented May 29, 2015

tkelman commented May 29, 2015

ScottPJones commented May 29, 2015

Choose a reason for hiding this comment

tkelman commented May 29, 2015

ScottPJones commented May 29, 2015

ScottPJones commented May 29, 2015

ScottPJones commented May 30, 2015