Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make console windows fully UTF-8 by default on Windows, in line with the behavior on Unix-like platforms - character encoding, code page #7233

Open
mklement0 opened this issue Jul 5, 2018 · 16 comments
Labels
KeepOpen The bot will ignore these and not auto-close WG-Interactive-Console the console experience WG-NeedsReview Needs a review by the labeled Working Group

Comments

@mklement0
Copy link
Contributor

mklement0 commented Jul 5, 2018

PowerShell Core now commendably defaults to UTF-8 encoding, including when sending strings to external programs, as reflected in $OutputEncoding's default value.

However, because the console-window shortcut file / taskbar entry still defaults to the OEM code page implied by the legacy system locale (e.g. 437 on US-English systems), it misinterprets strings from external programs; e.g., with Node.js installed:

PSCoreOnWin> $captured = '' | node -pe "require('fs').readFileSync(0).toString().trim()"; $captured
Γé¼    # !! node's UTF-8 output was misinterpreted.

This currently requires the following workaround (in addition to requiring the console window to use a TrueType font (true by default on Windows 10)):

[console]::InputEncoding = [console]::OutputEncoding = New-Object System.Text.UTF8Encoding

Prepend $OutputEncoding = to make a Windows PowerShell console fully UTF-8-aware.

The above implicitly switches to the UTF-8 code page (65001), as then reflected in chcp.

This obscure workaround shouldn't be necessary, and I think it would make sense for PowerShell to automatically set [console]::InputEncoding and [console]::OutputEncoding to (BOM-less) UTF-8 on startup.

Update: When this issue was originally created, there was no mechanism for presetting code page 65001 (UTF-8) system-wide, which necessitated the awkward workaround. In recent versions of Windows 10 it is now possible to switch to code page 65001 as the system locale and therefore system-wide, although as of Windows 10 version 1909 that feature is still in beta - see this SO answer.

  • Caveat: In addition to defaulting the OEM code page to 65001 in all console windows (including cmd.exe windows), this invariably also makes Windows PowerShell's ANSI-encoding-default cmdlets default to UTF-8, notably Get-Content and Set-Content, which can be problematic from a backward-compatibility perspective.
    Additionally, there is a bug - see below.

The change, which can also be made programmatically (see below), requires administrative privileges and a reboot.

Environment data

PowerShell Core 7.1.0-preview.3 on Windows 10
@iSazonov
Copy link
Collaborator

It is a platform default:
https://source.dot.net/#System.Console/System/Console.cs,a570cd79bd33ceab
https://source.dot.net/#System.Console/System/ConsolePal.Windows.cs,c997db0e94f0d1cc
https://source.dot.net/#System.Console/Common/Interop/Windows/Interop.GetConsoleOutputCP.cs,f028312cfc964730

So we need do [console]::InputEncoding = [console]::OutputEncoding = New-Object System.Text.UTF8Encoding at PowerShell Core startup. @mklement0 Right fix for all platforms and Windows versions (Windows 7?) ?

@mklement0
Copy link
Contributor Author

Thanks for the sleuthing, @iSazonov.

Yes, I think the fix is also appropriate for Windows 7:

While you're more likely to run into problems with standard console programs there that can even break with UTF-8 input, I think it's more important for PowerShell Core to exhibit consistent encoding behavior and to support modern, cross-platform utilities that natively speak UTF-8 by default.

@mklement0
Copy link
Contributor Author

mklement0 commented Aug 28, 2018

@iSazonov: Forgot to clarify: It is only the right fix for Windows - on Unix-like platforms the CoreFx default should be used, as discussed in #7634 (even though there's a CoreFx fix pending).

@iSazonov
Copy link
Collaborator

I hope @JamesWTruher could comment. I think he considered this in time writing and implementing Encoding RFC.

@iSazonov
Copy link
Collaborator

iSazonov commented Mar 12, 2020

Since Windows 7 EOL and community are migrating to Windows 10 it seems a time to switch a console default to UTF8 on WIndows.

/cc @SteveL-MSFT

@KalleOlaviNiemitalo
Copy link

@nu8, are you using Windows PowerShell? In PowerShell Core, the default encoding for Get-Content on files has been UTF8NoBOM since #5080.

@mklement0

This comment has been minimized.

@mklement0

This comment has been minimized.

@mklement0

This comment has been minimized.

@mklement0

This comment has been minimized.

@mklement0

This comment has been minimized.

@mklement0
Copy link
Contributor Author

mklement0 commented Jun 7, 2020

Let me try to summarize, now that we (hopefully) have the full picture:

I've hidden my previous comments in favor of this one, @nu8 - I encourage you to do the same, as appropriate. This comment also corrects my incorrect earlier claim that you cannot set the ANSI code page to 65001.

This issue is about making UTF-8 support in PowerShell on Windows complete, by making sure that PowerShell also uses UTF-8 when communicating with external programs (the built-in cmdlets already default to UTF-8, invariably so), which requires setting [console]::InputEncoding and [console]::OutputEncoding to (BOM-less) UTF-8 (possibly indirectly).


Currently, in the absence of PowerShell doing that itself, there are two workarounds:

Option 1: Put the following statement in your $PROFILE:

# In *Windows PowerShell*, prepend `$OutputEncoding = `
[console]::InputEncoding = [console]::OutputEncoding = [System.Text.UTF8Encoding]::new()

Pros and cons:

  • Doesn't require administrative privileges and takes effect in new windows without the need for a reboot.
  • Requires modifying $PROFILE
  • Is bypassed if the CLI is used as pwsh -noprofile ...

Note: In Windows PowerShell, you must prepend $OutputEncoding = to the above command, in order to also make Windows PowerShell send UTF-8 to external programs. (In PowerShell [Core], this preference variably commendably defaults to (BOM-less) UTF-8.)


Option 2: Change the active code pages to 65001 system-wide (W10+):

  • GUI method: via intl.cpl (Control Panel), tab Administrative, Change system locale...); as previously noted, this is still labeled as Beta: as of Windows 10 release 1909, though I suspect it will work fine as long as you use only modern command-line utilities.

  • Equivalent programmatic method, based on @nu8's approach:

# Requires ELEVATION and a REBOOT
'ACP', 'OEMCP', 'MACCP' | Set-ItemProperty HKLM:\SYSTEM\CurrentControlSet\Control\Nls\CodePage -Name { $_ } 65001
# Restart-Computer

Pros and cons:

  • Due to a .NET bug still present in the .NET version underlying PowerShell Core 7.1.0-preview.3, [console]::InputEncoding and [console]::OutputEncoding are mistakenly set to UTF-8 encoding with BOM, which causes follow-on bugs; notably, it breaks Start-Job in PowerShell. See System.Console unexpectedly uses a UTF-8 encoding *with BOM* on Windows dotnet/runtime#28929. Option 1 above doesn't have this problem.

    • Curiously, by contrast, the [System.Text.Encoding]::Default encoding that reflects the active ANSI code page contains a BOM-less UTF-8 encoding after the system-wide change (see below).

    • Note that the bug can also manifest without the system-wide change, namely if you manually run chcp 65001 from cmd.exe, for instance, before invoking PowerShell (running chcp from inside PowerShell isn't supported and requires Option 1 instead).

  • Requires administrative privileges and a reboot.

  • Takes effect system-wide: it applies to all console / Windows Terminal windows, notably including those running cmd.exe

  • Invariably also uses UTF-8 as the ANSI code page (not just the OEM code page), as reflected in [System.Text.Encoding]::Default (note that this also applies if you set only the OEMCP registry value to 65001; (Get-Culture).TextInfo.ANSICodePage, by contrast, continues to report the locale-appropriate code page, e.g. 1252).

    • If you're (also) running Windows PowerShell, this means that the setting invariably makes Windows PowerShell's ANSI-encoding-default cmdlets default to UTF-8, notably Get-Content and Set-Content, which, depending on your backward-compatibility needs:
      • may be desirable for consistent UTF-8 use across both PowerShell editions.
        • Note: short of placing $PSDefaultParameterValues['*:Encoding'] = 'utf8' in your $PROFILE, this is the only way to get Windows PowerShell to consistently default to UTF-8.
          Curiously, the system-wide change causes Windows PowerShell to then create BOM-less UTF-8 files by default with Set-Content, something that cannot otherwise achieved, except with direct use of .NET.
      • may be undesired, if you have existing code that uses Get-Content and Set-Content without -Encoding and you need to process BOM-less files that are ANSI- rather than UTF-8-encoded.
    • Also, in Windows PowerShell only, you must additionally still run
      $OutputEncoding = [System.Text.Utf8Encoding]::new() (via $PROFILE) in order to also make Windows PowerShell send UTF-8 to external programs.

A note on file encoding:

If making Windows PowerShell too default to UTF-8 via the system-wide change is not an option, BOM-less UTF-8 files will only be read correctly under one of the following conditions:

  • you use -Encoding Utf8 with file-handling cmdlets.

  • you convert your BOM-less UTF-8 files to have a BOM

    • such files can be problematic in cross-platform use; on Unix-like platforms, a UTF-8 BOM can be misinterpreted as data
    • conversely, if you write PowerShell code that contains (runtime-relevant) non-ASCII characters and needs to run in both editions, saving your source code files as UTF-8 with BOM is a must (though you could also use UTF-16).
  • you preset the default encoding via $PSDefaultParameterValues['*:Encoding'] = 'utf8', but you'll have to scope this setting if you don't want all code to use these defaults.

Note that Windows PowerShell - curiously, except if the system-wide change is made - only ever creates UTF-8 files with BOM (whereas PowerShell [Core] defaults to BOM-less UTF-8 and has an -Encoding utf8BOM opt-in); direct use of .NET is required to work around that - see this SO answer.

@mklement0

This comment was marked as outdated.

@gerardog
Copy link

gerardog commented Mar 29, 2022

Is it a good idea to implement this inside PowerShell?

Changing the encoding inside pwsh also changes the code page, which also affects:

  • The language of Windows console applications like CMD.EXE
  • The default encoding assumed when opening a file without BOM.

Those changes remain after pwsh ends, until the console is closed.
In other words: if running PWSH changes the encoding, it will impact the console session permanently:

image

So, (unless there is a way to decouple the encoding and the codepage... Why are they coupled in first place?) should the current code page by changed by a console app? I don't think so.

IMO, it either should be a system-wide setting, or a setting in WindowsTerminal / ConHost. Not a responsibility of a console app or a shell...

@mklement0
Copy link
Contributor Author

mklement0 commented Oct 18, 2023

Good point, @gerardog:

If PowerShell is called from another shell, or more generally, from an existing console window, it wouldn't be appropriate to change the console window's code page without also restoring it on exit.

either should be a system-wide setting

The system-wide change to UTF-8, as discussed in detail above, doesn't require any changes, and is already an option - but it has far-reaching consequences that may not work for everyone. Notably, both the OEM and the ANSI code page are then set to 65001 (UTF-8), which would break Windows PowerShell source code that uses BOM-less ANSI-encoded files containing non-ASCII characters.

a setting in WindowsTerminal / ConHost.

That is an option - but a very cumbersome one: for ConHost you'd have to do it on a per-window-title basis, via the registry, individual shortcut files and Windows Terminal profiles would have to be modified with startup commands.

The point is that PowerShell internally defaults to UTF-8, and externally it already defaults to UTF-8 when sending (piping) data, but not when receiving it, which makes for an awkward asymmetry.

In order to make external programs use UTF-8 too, it must set the console code page(s) - the latter are what well-behaved CLIs consult in order to decide what character encoding to use.


A simple solution - both conceptually simple and easy to document - would be to make PowerShell switch to UTF-8 (including changing the console code page) if and only if:

  • it owns the console window at hand (implying that it has no console-application parent process, such as a cmd.exe session / batch file)

  • if an interactive session is being entered (even from another shell / console application), in which case it should restore the original code page on exiting.

Conversely, that means that non-interactive CLI calls (via -Command (-c) or -File (-f) from existing console windows would continue to honor the current console window's code page(s).

@Demonese
Copy link

[console]::InputEncoding = [console]::OutputEncoding = New-Object System.Text.UTF8Encoding

Oh MY GOD, thanks for your solution! I have been troubled by this issue for a long time, even though I have switched the code page to UTF-8 (65001). I also tried $PSDefaultParameterValues['*:Encoding'] = 'utf8', but it didn't work.

Java sources:

System.out.println("""
        Hello world!
        你好,世界!
        こんにちは世界!
        안녕 세상!
        """.stripTrailing());
System.out.println("System.out.charset(): " + System.out.charset());
System.out.println("properties:");
System.getProperties().forEach((k, v) -> {
    if (k instanceof String ks && ks.contains("encod")) {
        System.out.printf("  %s = %s%n", k, v);
    }
});

Before:

PS D:\Project\java-encoding\target> chcp 65001
Active code page: 65001
PS D:\Project\java-encoding\target> java -jar java-encoding-1.0-SNAPSHOT-jar-with-dependencies.jar
Hello world!
浣犲ソ锛屼笘鐣岋紒
銇撱倱銇仭銇笘鐣岋紒
鞎堧厱 靹胳儊!
System.out.charset(): UTF-8
properties:
  sun.jnu.encoding = GBK
  stdout.encoding = UTF-8
  file.encoding = UTF-8
  native.encoding = GBK
  stderr.encoding = UTF-8
  sun.io.unicode.encoding = UnicodeLittle

After:

PS D:\Project\java-encoding\target> chcp 65001
Active code page: 65001
PS D:\Project\java-encoding\target> [console]::InputEncoding = [console]::OutputEncoding = New-Object System.Text.UTF8Encoding
PS D:\Project\java-encoding\target> java -jar java-encoding-1.0-SNAPSHOT-jar-with-dependencies.jar                            
Hello world!
你好,世界!
こんにちは世界!
안녕 세상!
System.out.charset(): UTF-8
properties:
  sun.jnu.encoding = GBK
  stdout.encoding = UTF-8
  file.encoding = UTF-8
  native.encoding = GBK
  stderr.encoding = UTF-8
  sun.io.unicode.encoding = UnicodeLittle

I strongly recommend providing the official UTF-8 configuration guide on the Windows platform. Otherwise, many developers will not be able to easily obtain the correct answer through Google/Bing/ChatGPT...

@mklement0 mklement0 changed the title Make console windows fully UTF-8 by default on Windows, in line with the behavior on Unix-like platforms Make console windows fully UTF-8 by default on Windows, in line with the behavior on Unix-like platforms - character encoding, code page Mar 26, 2024
@SteveL-MSFT SteveL-MSFT added KeepOpen The bot will ignore these and not auto-close WG-NeedsReview Needs a review by the labeled Working Group labels Apr 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
KeepOpen The bot will ignore these and not auto-close WG-Interactive-Console the console experience WG-NeedsReview Needs a review by the labeled Working Group
Projects
None yet
Development

No branches or pull requests

7 participants
@mklement0 @gerardog @SteveL-MSFT @iSazonov @KalleOlaviNiemitalo @Demonese and others