Case folding fixes #133

stevengj · 2018-04-30T02:24:37Z

Updated version of #102:

Restores the original behavior of IGNORE so that this PR is non-breaking, adds new STRIPNA flag.
Renames the new function to utf8proc_NFKC_Casefold instead of utf8proc_NFKC_CF
Adds a minimal test.
Updates the utf8proc_data.c file.

To do:

Compare the result of UTF8PROC_CASEFOLD before and after this PR to make sure any changes are in the right direction. No differences found.

* Only include C (Common) and F (Full) foldings from CaseFolding.txt. Removed S (Simple) since F & S are specified to be exclusive. * Extend UTF8PROC_IGNORE to also ignore unassigned codepoints (such as \u2065) which are specified as being discarded by NFKC_CF.

…_NFKC_Casefold, add a test

stevengj · 2018-04-30T02:35:48Z

@nomoon, you wrote in #102 that unassigned codepoints are "specified as being discarded by NFKC_CF", but I can find no such specification.

In section 3.13 (Default Case Algorithms) of the Unicode specification, it says:

A modified form of Default Case Folding is designed for best behavior when doing caseless matching of strings interpreted as identifiers. This folding is based on Case_Folding(C), but also removes any characters which have the Unicode property value Default_Ignorable_Code_Point=True. It also maps characters to their NFKC equivalent sequences. Once the mapping for a string is complete, the resulting string is then normal- ized to NFC. That last normalization step simplifies the statement of the use of this folding for caseless matching.

According to section 5.21, it says:

The default ignorable code points are listed in DerivedCoreProperties.txt in the Unicode Character Database with the property Default_Ignorable_Code_Point.

and it doesn't seem like unassigned codepoints should be treated as ignorable.

Do you have any reference to the contrary? If not, I will remove the STRIPNA flag from NFKC_Casefold (but I will leave the flag in the API, since some people may want this transformation).

nomoon · 2018-04-30T17:53:16Z

@stevengj It's been a long time since I wrote the PR so I'm not sure where that came from. I'll look if I have the chance. In any event, it would be good to have the option available so as not to have to scrub the string of invalid codepoints as a separate step, since many use-cases of the NFKC_Casefold would a) assume that the string is valid, and b) possibly not properly case-fold if confused by invalid points.

stevengj · 2018-04-30T18:31:27Z

(Note that unassigned != invalid.)

nomoon · 2018-04-30T21:49:58Z

@stevengj Of course. My bad. But yeah either way I can't find where I read that (possibly mis-read the ICU documentation or code).

nomoon and others added 7 commits April 29, 2018 21:43

Document the changes to UTF8PROC_IGNORE in header.

8e317d0

Add NFKC_CF helper function with documentation.

a96c9b4

restore old IGNORE behavior, add UTF8PROC_STRIPNA, rename to utf8proc…

edff036

…_NFKC_Casefold, add a test

success message

f27c5a9

test that IGNORE does not strip NA

144f90d

data update

56f3b1b

stevengj mentioned this pull request Apr 30, 2018

Fixes allowing for “Full” folding and NFKC_CaseFold compliance. #102

Closed

NFKC_Casefold shouldn't strip NA

da72c45

stevengj merged commit bdc8b9e into master May 2, 2018

stevengj deleted the case_folding_fixes_new branch May 2, 2018 12:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Case folding fixes #133

Case folding fixes #133

stevengj commented Apr 30, 2018 •

edited

Loading

stevengj commented Apr 30, 2018

nomoon commented Apr 30, 2018

stevengj commented Apr 30, 2018

nomoon commented Apr 30, 2018

Case folding fixes #133

Case folding fixes #133

Conversation

stevengj commented Apr 30, 2018 • edited Loading

stevengj commented Apr 30, 2018

nomoon commented Apr 30, 2018

stevengj commented Apr 30, 2018

nomoon commented Apr 30, 2018

stevengj commented Apr 30, 2018 •

edited

Loading