Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Case folding fixes #133

Merged
merged 8 commits into from
May 2, 2018
Merged

Case folding fixes #133

merged 8 commits into from
May 2, 2018

Conversation

stevengj
Copy link
Member

@stevengj stevengj commented Apr 30, 2018

Updated version of #102:

  • Restores the original behavior of IGNORE so that this PR is non-breaking, adds new STRIPNA flag.
  • Renames the new function to utf8proc_NFKC_Casefold instead of utf8proc_NFKC_CF
  • Adds a minimal test.
  • Updates the utf8proc_data.c file.

To do:

  • Compare the result of UTF8PROC_CASEFOLD before and after this PR to make sure any changes are in the right direction. No differences found.

nomoon and others added 7 commits April 29, 2018 21:43
* Only include C (Common) and F (Full) foldings from CaseFolding.txt. Removed S (Simple) since F & S are specified to be exclusive.
* Extend UTF8PROC_IGNORE to also ignore unassigned codepoints (such as \u2065) which are specified as being discarded by NFKC_CF.
@stevengj
Copy link
Member Author

@nomoon, you wrote in #102 that unassigned codepoints are "specified as being discarded by NFKC_CF", but I can find no such specification.

In section 3.13 (Default Case Algorithms) of the Unicode specification, it says:

A modified form of Default Case Folding is designed for best behavior when doing caseless matching of strings interpreted as identifiers. This folding is based on Case_Folding(C), but also removes any characters which have the Unicode property value Default_Ignorable_Code_Point=True. It also maps characters to their NFKC equivalent sequences. Once the mapping for a string is complete, the resulting string is then normal- ized to NFC. That last normalization step simplifies the statement of the use of this folding for caseless matching.

According to section 5.21, it says:

The default ignorable code points are listed in DerivedCoreProperties.txt in the Unicode Character Database with the property Default_Ignorable_Code_Point.

and it doesn't seem like unassigned codepoints should be treated as ignorable.

Do you have any reference to the contrary? If not, I will remove the STRIPNA flag from NFKC_Casefold (but I will leave the flag in the API, since some people may want this transformation).

@nomoon
Copy link

nomoon commented Apr 30, 2018

@stevengj It's been a long time since I wrote the PR so I'm not sure where that came from. I'll look if I have the chance. In any event, it would be good to have the option available so as not to have to scrub the string of invalid codepoints as a separate step, since many use-cases of the NFKC_Casefold would a) assume that the string is valid, and b) possibly not properly case-fold if confused by invalid points.

@stevengj
Copy link
Member Author

(Note that unassigned != invalid.)

@nomoon
Copy link

nomoon commented Apr 30, 2018

@stevengj Of course. My bad. But yeah either way I can't find where I read that (possibly mis-read the ICU documentation or code).

@stevengj stevengj merged commit bdc8b9e into master May 2, 2018
@stevengj stevengj deleted the case_folding_fixes_new branch May 2, 2018 12:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants