New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Simple case folding #8120
Comments
I started. Alfa code https://github.com/iSazonov/PowerShell/tree/add-unicode3/src/System.Management.Automation/utils/unicode Better result is 10x win in best case and -4x in worst case. It is worth exploring further! @mklement0 Have you an experience with Unicode? I'll need help to create a test set and review API design. |
(Somewhat ironically, the Thanks for looking into this, looks promising. I think I have a basic understanding of Unicode fundamentals, but no experience relating to implementation aspects. Can you elaborate on how I could help? Not sure what you mean by test site. As for implementing the Unicode-version-dependent case-folding tables ourselves, please note the following comments from https://github.com/dotnet/corefx/issues/17233:
So it sounds like we should take advantage of OS support where available (ICU on Linux; ?? on macOS). |
Test set :-) It is Friday evening. If we would develop the APIs we should create a full set of tests. (We could look and grab CoreFX/CoreCLR tests - possible start search point dotnet/corefx#32968).
http://unicode.org/ contains standard papers. I could share more tomorrow.
Inital plan is:
Ideal plan is to use the APIs in PowerShell engine for identifers. A perfect plan :-) is to get full featured API in CoreFX/CoreCLR. |
Today I have done some performance tests and it seems it will not be easy to overtake Kernel32 p/invokes. Though I have no experience with low level optimizations at all. |
From: http://unicode.org/faq/casemap_charprop.html
From http://www.unicode.org/reports/tr31/tr31-29.html#R5
From http://www.unicode.org/reports/tr21/tr21-5.html
From http://unicode.org/reports/tr18/#Folded_Matching
From http://unicode.org/reports/tr18/#Simple_Loose_Matches
From http://www.unicode.org/versions/Unicode11.0.0/ch03.pdf
A modified form of Default Case Folding is designed for best behavior when doing caseless
Not related but interesting http://www.unicode.org/notes/tn31/tn31-2.html |
Additianal link from Rust RegEx library repository https://github.com/rust-lang/regex/blob/77140e7c1628ac31026d2421c6a4c3b0eb19506c/UNICODE.md#rl15-simple-loose-matches |
First alfa code performance results (yes, I got the code fast!): BenchmarkDotNet=v0.11.1, OS=Windows 10.0.10240.17146 (1507/RTM/Threshold1)
|
Given how frequently PS uses case-insensitive comparisons, that will likely make a massive difference if it can be utilised widely in the PS code base. Looking good @iSazonov!! |
Yes, it is encouraging. I updated link above to the alfa code. Now I started to work on fold comparison methods and perf tests. |
This can not be trusted but the first results are impressive. BenchmarkDotNet=v0.11.1, OS=Windows 10.0.10240.17146 (1507/RTM/Threshold1)
|
I updated my work branch. Now the code is placed in correct (I hope :-) ) folders and you can compile PowerShell and run xUnit and performance tests. Next step is to create a comparer based on simple folded strings. Then we could try it in PowerShell engine. |
Great stuff, @iSazonov; thanks also for the helpful collection of quotes from the Unicode standard. I've never had to deal with Unicode normalization problems in the context of .NET (e.g., that accented letter |
There is case mapping and case folding. Unicode standard exposes CaseFolding.txt file containing a table for case folding (full and simple in one table). We parse the file (we created a PowerShell script for this) and one-time generate C# file with two static char arrays - first for 2-byte code points and second for 4-byte code points - this is important because chars in C# is Utf16 2-byte values, it also means that we have to process surrogates. These array is used for one-step very fast char conversions. Surrogates require additional calculations but surrogates is very rare and the calculations is very simple - this should not degrade performance. |
Thanks, @iSazonov. While the normalization problem I've mentioned is unrelated to casing per se, I still wonder if it affects us here: # Composed normalization form: 'ö' as a *single* code point.
$oUmlautNFC = [string] [char] 0xf6
# Decomposed normalization form: 'ö' as *two* code points:
# The base character, 'o', followed by a combining diacritic.
$oUmlautNFD = -join [char[]] (0x6f, 0x308)
# String comparison *does* recognize the equivalence, even though
# the strings have distinct code points:
$oUmlautNFC -eq $oUmlautNFD # -> $True! Does your code recognize this kind of equivalence too? |
I would expect that it doesn't need to, @mklement0. In that example, let's consider how case-folding would treat the characters: I would expect that folding the single code-point variant would consider both In the case of the regular character and the combining diacritic, it only needs to fold case for the regular character, and the combining diacritic would be unaffected. |
@vexx32: That applies if both input strings are in the same normalization form (and only differ by case), but the question is what happens if they're not: With # NFC 'Ö' (LATIN CAPITAL LETTER O WITH DIAERESIS)
$oUmlautNFC_Upper = [string] [char] 0xd6
# Compare uppercase (single-code-point) NFC Ö to lowercase (2-code-point) NFD ö
$oUmlautNFC_Upper -eq $oUmlautNFD # -> !True! By contrast - much to my surprise - $oUmlautNFC_Upper | Select-String $oUmlautNFD # !! NO OUTPUT Is it by design that Either way, while the answer may be that we cannot handle normalization differences for performance reasons, we need to have clarity on that decision and to document it accordingly. |
@mklement0 Your question is open question. I think the question is "What is PowerShell identifier" in Unicode terms and our answer should be based on http://www.unicode.org/reports/tr31/
|
There's a chapter on normalization in the document you link to: http://www.unicode.org/reports/tr31/#normalization_and_case Have only glanced at it at this point, but this caught my eye:
So it sounds like disallowing variants (making no attempt to recognize different normalization forms as equivalent) is an option - again, a decision we should be clear about and document. |
Yes, the standard describes all common possibilities. And it says about normalization of identifiers. This allows "human" identifiers. I believe it make sense for DSL or AI and make no sense for strong typed languages like C# and perhaps for PowerShell too - I'd very wonder if PowerShell consider "$Ёлка" and "$Елка" as equal although both are Christmas tree (the sample is only "display difference"! not Unicode specific). |
I agree that for identifiers disallowing normalization variants makes sense. (As an aside: your
That explains why |
In Russian we can freely replace Ё - 0401 with Е - 0415. (For lower case too). But "Ёлка" -eq "Елка" returns false - I don't know why. Perhaps because of Unicode backword compatibility and stability only. I don't wonder about I've been going to do this for a long time #8180 for |
Thanks for the background on
It is unfortunate that The difference rarely matters, but there is the infamous Turkish "I" problem in the [The following is at least partially incorrect.]
|
Seems the Turkish case mapping is not related with case folding. I believe simple case folding works with Turkish well because the folding is intended to be language-neutral. This is just the reason why we can get a win in regex and PowerShell identifier comparisons internally and can not in -match / -replace / split. |
It is part of the case-folding table, but of the language-specific variety: status
To reiterate: case folding can be language-neutral or language-specific (tailored), as needed. But I agree that for identifiers it definitely makes sense to stay language-neutral.
Understood re identifiers, but what regex used do you mean in this case, given that regexes are also used for |
|
Today I measured |
I opened new issue in CoreFXlab repo to start a work to add the enhancement in .Net Core. |
@SteveL-MSFT I am ready to create a nuget package in CoreFXlab dotnet/corefxlab#2610 |
@iSazonov I think we can use it with 6.3-preview.1 |
@SteveL-MSFT Thanks! Currently the code was merged in Corefxlab but nuget package is not still created. Targeting is to .Net Core 3.0 only. Now there need to review public api and to make performance optimizations. This is a work for MSFT experts. It would be nice if you could sync internally with Core team. |
Moving out of 7.0 as it seems .NET 5 may support this |
Yes, they plan to move ICU internally in Core and unify Core behavior (to fall back to OS with p/invoke sometimes). In the case OrdinalIgnoreCase will implement the SCF, I hope with great performance. While waiting Core 5.0 we could resolve #8180 without risk to conflict with Core. |
From http://unicode.org/faq/casemap_charprop.html :
I opened https://github.com/dotnet/corefx/issues/33047 to implement simple case folding.
We could use this for improving performance in:
The text was updated successfully, but these errors were encountered: