Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Large hiragana and small hiragana in Japanese are treated as the same letter. #20786

Closed
5 tasks done
AWtnb opened this issue Nov 27, 2023 · 9 comments
Closed
5 tasks done

Comments

@AWtnb
Copy link

AWtnb commented Nov 27, 2023

Prerequisites

Steps to reproduce

"" -eq ""
"" -eq ""
"" -eq ""
"" -eq ""
"" -eq ""
"" -eq ""
"" -eq ""
"" -eq ""
"" -eq ""
"" -eq ""

Expected behavior

PS> 
>> $a1 = "あいうえおつやゆよわ"
>> $a2 = "ぁぃぅぇぉっゃゅょゎ"
>> $i = 0
>> $a1.GetEnumerator()|% {
>>   ($_ -as [string]) -eq ($a2[$i] -as [string])
>>   $i++
>> }
>> False
>> False
>> False
>> False
>> False
>> False
>> False
>> False
>> False

Actual behavior

PS> 
>> $a1 = "あいうえおつやゆよわ"
>> $a2 = "ぁぃぅぇぉっゃゅょゎ"
>> $i = 0
>> $a1.GetEnumerator()|% {
>>   ($_ -as [string]) -eq ($a2[$i] -as [string])
>>   $i++
>> }
>> True
>> True
>> True
>> True
>> True
>> True
>> True
>> True
>> True

Error details

No response

Environment data

PS> $PSVersionTable

Name                           Value
----                           -----
PSVersion                      7.4.0
PSEdition                      Core
GitCommitId                    7.4.0
OS                             Microsoft Windows 10.0.22621
Platform                       Win32NT
PSCompatibleVersions           {1.0, 2.0, 3.0, 4.0…}
PSRemotingProtocolVersion      2.3
SerializationVersion           1.1.0.1
WSManStackVersion              3.0

Visuals

1

@AWtnb AWtnb added the Needs-Triage The issue is new and needs to be triaged by a work group. label Nov 27, 2023
@AWtnb
Copy link
Author

AWtnb commented Nov 27, 2023

In Japanese hiragana, there are two types of letters: the larger ones, あいうえおつやゆよわ, and the smaller ones, ぁぃぅぇぉっゃゅょゎ.

These are different characters and pronounced differently.

Example:
(Phonetic notations are from https://en.wiktionary.org )

  • ふあん [fùáń] (不安 = anxiety) / ふぁん [fáꜜǹ]
  • しや [shíꜜyà] (視野 = visible area) / しゃ [sháꜜ]
  • きよう [kíꜜyòò] (器用 = skilful) / きょう [kyō]

In Windows PowerShell 5.1, these were correctly treated as different letters.

2

But they are treated as the same in versions newer than PowerShell Core (I first discovered this phenomenon in PowerShell 7.1.0).

PS> "" -eq ""
True
PS> "" -ceq ""
False

Each of and , and , ... are all different letters.
So both "あ" -eq "ぁ" and "あ" -ceq "ぁ" should be False (unlike the relationship between uppercase and lowercase of the alphabet in English).

This phenomenon also occurs when unicode codepoints are used to specify characters.

PS> "`u{3041}"
ぁ
PS> "`u{3042}"
あ
PS> "`u{3041}" -eq "`u{3042}"
True
PS> "`u{3041}" -ceq "`u{3042}"
False

On the other hand, Katakana-letters seems to be treated correctly:

3

Unicode Table

Letter Codepoint
U+3041
U+3042
U+3043
U+3044
U+3045
U+3046
U+3047
U+3048
U+3049
U+304a
U+3063
U+3064
U+3083
U+3084
U+3085
U+3086
U+3087
U+3088
U+308e
U+308f

@mklement0
Copy link
Contributor

mklement0 commented Nov 27, 2023

PowerShell is just the messenger here, I think this comes down to a change in localization libraries that was introduced in .NET 5 (at the time of PowerShell 7.1), namely the move from NLS to ICU.

In many contexts, PowerShell uses the invariant culture for string operations, which means that:

"" -eq ""

is effectively the same as:

[string]::Equals("", "", 'InvariantCultureIgnoreCase')

And in .NET 5+ / PowerShell 7.1+, this now yields $true.

The linked help topic discusses an opt-in to the old (.NET Framework) behavior based on the Windows-only NLS APIs, but note that in PowerShell it will invariably apply session-wide.


Note that [char] instances compare differently:

# -> $false in .NET 5+ / PowerShell 7.1+ too
[char] '' -eq  [char] ''

The reason is that with -eq (same as: -ieq) [char]::ToUpperInvariant() is called on both operands, and
[char]::ToUpperInvariant('ぁ') remains .
(I don't know why this differs from the [string] behavior.)

@jhoneill
Copy link

If you want case sensitive comparison use -ceq instead of -eq (and you can write -ieq to be explicit that you are using case insensitive).
"あ" -eq "ぁ"
Returns true
"あ" -ceq "ぁ"
Returns false.

@AWtnb
Copy link
Author

AWtnb commented Nov 27, 2023

@mklement0 Thanks for the reply!
I understood that this was caused by the change of NLS to ICU inside .Net.

I found the following in the link.

To revert back to using NLS, a developer can opt out of the ICU implementation.

Is there any way to opt out of ICU and back to NLS from $profile in PowerShell?
I have tried the DOTNET_SYSTEM_GLOBALIZATION_USENLS environment variable as below, but it does not works...

# inside $profile
$env:DOTNET_SYSTEM_GLOBALIZATION_USENLS = 1

I also tried setting System.Globalization.UseNls to true, but could not find a UseNls member in System.Globalization.

Is this something that has to do with .NET and can't be configured from PowerShell?

@mklement0
Copy link
Contributor

@AWtnb, the DOTNET_SYSTEM_GLOBALIZATION_USENLS environment variable must be defined before PowerShell starts, so doing it inside your $PROFILE file is too late.

You can define a persistent environment variable, but note that this means that all future sessions - whether or not profile loading is disabled with -NoProfile - will then use the NLS APIs.

E.g., for the current user (a one-time action):

[Environment]::SetEnvironmentVariable('DOTNET_SYSTEM_GLOBALIZATION_USENLS', '1', 'User')

@AWtnb
Copy link
Author

AWtnb commented Nov 28, 2023

@mklement0
Understood. I am afraid that changing Windows system environment variables may have a large impact.
I will use the String.Equals method to compare characters containing Japanese characters.

It is too much work to type [System.StringComparison]::Ordinal every time, so I wrote the following in $PROFILE and created my own method.

Update-TypeData -TypeName "System.String" -Force -MemberType ScriptMethod -MemberName ExactlyEquals -Value {
    param([string]$s)
    return [string]::Equals($this, $s, [System.StringComparison]::Ordinal)
}

image

(Prompt is customized in $PROFILE)

@AWtnb
Copy link
Author

AWtnb commented Nov 28, 2023

Oops, the above custom method would be the same as using -ceq.

For the original purpose:

Update-TypeData -TypeName "System.String" -Force -MemberType ScriptMethod -MemberName CaseInSensitiveEquals -Value {
    param([string]$s)
    return [string]::Equals($this, $s, [System.StringComparison]::OrdinalIgnoreCase)
}

image

@mklement0
Copy link
Contributor

Yes, I forgot to mention that setting the DOTNET_SYSTEM_GLOBALIZATION_USENLS environment variable persistently wouldn't just affect PowerShell (Core) sessions, but all .NET (Core) applications (except those with manifests explicitly preventing an NLS opt-in).

As an aside: Your first attempt at defining .ExactlyEquals() would not be equivalent to -ceq; the latter uses InvariantCulture, not Ordinal (and "あ" -ceq "ぁ" too yields $false).

@AWtnb
Copy link
Author

AWtnb commented Nov 28, 2023

I learned a lot! Thank you very much.

@AWtnb AWtnb closed this as completed Nov 28, 2023
@microsoft-github-policy-service microsoft-github-policy-service bot removed the Needs-Triage The issue is new and needs to be triaged by a work group. label Nov 28, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants