-
Notifications
You must be signed in to change notification settings - Fork 135
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Case folding fixes #133
Case folding fixes #133
Conversation
* Only include C (Common) and F (Full) foldings from CaseFolding.txt. Removed S (Simple) since F & S are specified to be exclusive. * Extend UTF8PROC_IGNORE to also ignore unassigned codepoints (such as \u2065) which are specified as being discarded by NFKC_CF.
…_NFKC_Casefold, add a test
@nomoon, you wrote in #102 that unassigned codepoints are "specified as being discarded by NFKC_CF", but I can find no such specification. In section 3.13 (Default Case Algorithms) of the Unicode specification, it says:
According to section 5.21, it says:
and it doesn't seem like unassigned codepoints should be treated as ignorable. Do you have any reference to the contrary? If not, I will remove the |
@stevengj It's been a long time since I wrote the PR so I'm not sure where that came from. I'll look if I have the chance. In any event, it would be good to have the option available so as not to have to scrub the string of invalid codepoints as a separate step, since many use-cases of the NFKC_Casefold would a) assume that the string is valid, and b) possibly not properly case-fold if confused by invalid points. |
(Note that unassigned != invalid.) |
@stevengj Of course. My bad. But yeah either way I can't find where I read that (possibly mis-read the ICU documentation or code). |
Updated version of #102:
IGNORE
so that this PR is non-breaking, adds newSTRIPNA
flag.utf8proc_NFKC_Casefold
instead ofutf8proc_NFKC_CF
utf8proc_data.c
file.To do:
UTF8PROC_CASEFOLD
before and after this PR to make sure any changes are in the right direction. No differences found.