Prepare for BOM-less UTF-8 default character encoding with respect to $OutputEncoding and console code page #4681

mklement0 · 2017-08-27T03:32:38Z

BOM-less UTF-8 character encoding is coming as the default for PowerShell Core on all platforms.

Two attendant changes are required:

Preference variable $OutputEncoding, which currently defaults to ASCII, must default to [System.Text.UTF8Encoding]::new() (UTF-8 with no BOM), or, perhaps preferably, not predefine this variable and default to that encoding (the internally used default) in its absence.
- $OutputEncoding tells PowerShell what character encoding to use when sending output to external utilities.
Console / terminal character encoding:
- On Windows, [Console]::InputEncoding and [Console]::OutputEncoding must both be set to [System.Text.UTF8Encoding]::new(), which is the equivalent of configuring a console window to use code page 65001 (UTF-8) or executing chcp 65001 before PowerShell is launched.
  - [Console]::OutputEncoding tells PowerShell what encoding to assume when reading output from external utilities.
  - On Windows, the Start Menu shortcut that is created during installation should be preconfigured to open a console window with code page 65001.
    - Conceivably, PowerShell should automatically switch to the 65001 code page in case it is launched from a console window with a different active code page (such as from cmd.exe), though it is worth noting that this change in encoding by default remains in effect until the window is closed (even after exiting PowerShell and returning to cmd.exe; perhaps a warning could be issued on startup).
- On Unix platforms with UTF-8-based locales, which are the norm these days, no action is required.
  - To be determined: How should the rare event of being invoked from a terminal with a different active character encoding be handled? Changing the encoding on the fly, as on Windows, is not guaranteed to work. Perhaps a warning on startup is sufficient.

Before the above is implemented, the interim workaround to make a console window / terminal use UTF-8 consistently is the following command:

$OutputEncoding = [console]::InputEncoding = [console]::OutputEncoding = [System.Text.UTF8Encoding]::new()

Environment data

PowerShell Core v6.0.0-beta.6

The text was updated successfully, but these errors were encountered:

iSazonov · 2017-08-28T05:59:15Z

It is already addressed in #4119

mklement0 · 2017-08-28T14:19:09Z

Thanks, @iSazonov, good to see how far the work has progressed.

However, as far as I can tell, #4119 addresses neither of the two issues raised here (which is why I called them attendant changes).

iSazonov · 2017-08-29T06:47:15Z

4119 was closed w/o merge - We are waiting new PR.

muzzar78 · 2017-10-02T13:21:08Z

I noticed a difference between PS 5.1 and PS core 6 beta 7 using get-content and the -raw switch with a file that ended with a LF. In 5.1 the LF was read as part of the string and was the last character in my variable, where as in 6beta7 the LF was stripped. Is this by design? If I used "-encoding binary" then both 5.1 and 6 read the LF. This may cause some issues.

SteveL-MSFT · 2017-10-02T22:28:46Z

raw should be raw including the EOL. Seems like a bug to me.

mklement0 · 2017-10-03T02:18:23Z

@SteveL-MSFT: Indeed; please see #4980

JamesWTruher · 2017-10-31T20:06:19Z

I'm not sure that setting outputEncoding to utf8 w/o bom is correct, at least on some platforms. Here's an example, from my MacBook; I have a compressed tar archive, which I would love to unspool as:
gc a.tgz | gunzip | tar xvf -

setting $outputEncoding to utf8nobom doesn't do the trick:

PS> $outputEncoding = $utf8                                                                                                                 
PS> gc -raw f.tgz | gunzip | tar tfv -
gunzip: unknown compression format

It turns out there's a couple of problems;
first when we read the content, we break up the output into lines, but I can fix that with -raw, but that still doesn't work. However, if I read the file and set $outputEncoding with encoding set to iso-8859-1, viola it works (sort of)!

PS> $outputEncoding = $enc
PS> gc -raw -encoding $enc f.tgz | gunzip | tar tvf -                                                                                       
gunzip: (stdin): trailing garbage ignored
-rw-r--r--  0 james  wheel       2 Oct 31 12:39 1.txt
-rw-r--r--  0 james  wheel       2 Oct 31 12:36 2.txt
-rw-r--r--  0 james  wheel       2 Oct 31 12:36 3.txt
...

The last problem is that we seem to tack on [environment]::newline to whatever we push down the pipe (which is causing gunzip to complain about the trailing garbage).

mklement0 · 2017-11-01T02:12:07Z

@JamesWTruher:

$OutputEncoding should only ever apply when sending text to external utilities.

The real problem here - a separate issue - is that PowerShell knows ONLY text; it lacks support for passing binary data through the pipeline.

Get-Content is designed for reading text files and -Raw makes matters worse in this case by reading the entire file at once.
(I always thought -Raw was an unfortunate name for what this switch does.)

While you might think that something like
Get-Content -AsByteStream f.tgz | gunzip | tar tfv -
would work, in reality PowerShell sends the textual, decimal representations of the byte values through the pipeline.

Contrast that with Unix utility cat, which simply copies raw bytes from stdin to stdout - it has no concept of encoding.

A "binary pipeline" ~~has apparently partially been implemented - namely for direct external-utility-to-external-utility piping~~ [update: seemingly not, as of v6.2.0_] - but AFAIK there's currently still no way to send binary data from a PowerShell command through the pipeline in raw binary format.

iSazonov · 2017-11-01T05:07:10Z

Can we detect -AsByteStream and then consider "cmdlet | utility" as binary pipeline?

JamesWTruher · 2017-11-01T17:36:54Z

@mklement0
a byte stream can be considered as a stream of characters. It is the dotnet coercion into an encoded string which is the issue, which is why my selection of iso-8859-1 works - it does not change the representation of bytes (in the case of unicode padded 0, or up to 5(!) bytes in the case of utf7)

At the base of all of this is that the method used to read the data from the process (in corefx) is rendered to a string, which means we need to find an encoding which does not alter the individual characters (as does utf8/unicode, etc) but pass them through unmolested as does iso-8859-1

The selection of iso-8859-1 as $OutputEncoding does what you are looking for, which is why my example as described works. I can get the contents of file in powershell and pipe it to native executables without issue - A binary pipeline

mklement0 · 2017-11-02T02:04:33Z

@iSazonov: I like the idea.

mklement0 · 2017-11-02T02:18:02Z

@JamesWTruher

if iso-8859-1 works - it does not change the representation of bytes

Yes, it does change the representation, you just have to find the right characters:

> $outputEncoding = [System.Text.Encoding]::GetEncoding('iso-8859-1')
> '€' | cat
?
> '€' | grep '€'  # no output

€ has no representation in iso-8859-1 (it is outside the 8-bit range in Unicode), so it gets transliterated to a literal ?.

Now let's try with UTF-8:

> $outputEncoding = [System.Text.UTF8Encoding]::new()
> '€' | cat
€
> '€' | grep '€'
€

Voila: the UTF-16LE representation of € was correctly translated into its UTF-8 representation.

That's why when it comes to text, UTF-8 is the right default value for $OutputEncoding.

When binary output is desired, by contrast, there is no reason to bring character encodings into the picture at all.

Get-Content -AsByteStream is the closest thing we have to bypassing as-string processing, but you cannot send this byte stream through the pipeline as-is, as stated.

@iSazonov's suggestion is promising, but I wonder if it goes far enough; there may be other cases where passing raw bytes through the pipeline from PowerShell is needed.

stknohg · 2017-11-07T03:47:23Z

Hi.
We also need to think about non-English langauage, for example CJK.

# PowerShell 6.0 Beta.9 on CentOS 7.4
PS /> $outputEncoding = [System.Text.Encoding]::GetEncoding('iso-8859-1')
PS /> 'こんにちは世界' | cat
???????
PS /> $outputEncoding = [System.Text.UTF8Encoding]::new()
PS /> 'こんにちは世界' | cat
こんにちは世界

I think $OutputEncoding should be same as [Console]::OutputEncoding.
From CoreFx sources, [Console]::OutputEncoding is UTF-8 on Unix.
[Console]::OutputEncoding depends on GetConsoleOutputCP() function on Windows.

JamesWTruher · 2017-11-07T18:47:55Z

with outputEncoding set in this way, the following scenario will not work
get-content archive.tgz | gunzip | tar xvf -

SteveL-MSFT · 2017-11-07T18:56:45Z

@JamesWTruher understood, I think it's ok for 6.0.0 since that never worked correctly

daxian-dbw · 2017-11-07T22:49:53Z

Close via #5369

Reverts #16271 Fixs #15913 Problem: Since #16271, `make_filter_cmd` uses `Start-Process` cmdlet to execute the user provided shell command for `:%!`. `Start-Process` requires the command to be split into the shell command and its arguments. This was implemented in #19268 by parsing (splitting the user-provided command at the first space) which didn't handle cases such as -- - commands with escaped space in their filepath - quoted commands with space in their filepath Solution: Use piping. The total shell command formats (excluding noise of unimportant parameters): 1. Before #16271 ```powershell pwsh -C "(shell_cmd) < tmp.in | 2>&1 Out-File -Encoding UTF8 <tmp.out>" # not how powershell commands work ``` 2. Since #16271 ```powershell pwsh -C "Start-Process shell_cmd -RedirectStandardInput <tmp.in> -RedirectStandardOutput <tmp.out>" # doesn't handle executable path with space in it # doesn't write error to <tmp.out> ``` 3. This PR ```powershell pwsh -C "& { Get-Content <tmp.in> | & 'path\with space\to\shell_cmd.exe' arg1 arg2 } 2>&1 | Out-File -Encoding UTF8 <tmp.out>" # also works with forward slash in the filepath # also works with double quotes around shell command ``` After this PR, the user can use the following formats: :%!c:\Program` Files\Git\usr\bin\sort.exe :%!'c:\Program Files\Git\usr\bin\sort.exe' :%!"c:\Program Files\Git\usr\bin\sort.exe" :%!"c:\Program` Files\Git\usr\bin\sort.exe" They can even chain different commands: :%!"c:\Program` Files\Git\usr\bin\sort.exe" | sort.exe -r But if they want to call a stringed executable path, they have to provide the Invoke-Command operator (&). In fact, the first stringed executable path also needs this & operator, but this PR adds that behind the scene. :%!"c:\Program` Files\Git\usr\bin\sort.exe" | sort.exe -r | & 'c:\Program Files\Git\usr\bin\sort.exe' ## What this PR solves - Having to parse the user-provided bang ex-command (for splitting into shell cmd and its args). - Removes a lot of human-unreadable `#ifdef` blocks. - Accepting escaped spaces in executable path. - Accepting quoted string of executable path. - Redirects error and exception to tmp.out (exception for when `wrong_cmd.exe not found`) ## What this PR doesn't solve - Handling wrongly escaped path to executable, which the user may pass because of cmdline tab-completion. #18592 ## Edge cases - (Not handled) If the user themself provides the `&` sign (means `call this.exe` in powershell) - (Not handled) Use `-Encoding utf8` parameter for `Get-Content`? - (Handled) Doesn't write to tmp.out if shell command is not found. - fix: use anonymous function (`{wrong_cmd.exe}`). ## Changes other than `make_filter_cmd()` function - Encoding for piping to external executables. See BOM-less UTF8: PowerShell/PowerShell#4681

mklement0 mentioned this issue Aug 27, 2017

Unicode characters from Node.js are broken when saved or piped #4678

Closed

mklement0 changed the title ~~Prepare for no-BOM UTF-8 default encoding with respect to $OutputEncoding and console code page~~ Prepare for BOM-less UTF-8 default encoding with respect to $OutputEncoding and console code page Aug 27, 2017

mklement0 changed the title ~~Prepare for BOM-less UTF-8 default encoding with respect to $OutputEncoding and console code page~~ Prepare for BOM-less UTF-8 default character encoding with respect to $OutputEncoding and console code page Aug 27, 2017

iSazonov added the WG-Engine core PowerShell engine, interpreter, and runtime label Aug 28, 2017

SteveL-MSFT added this to the 6.0.0-HighPriority milestone Sep 24, 2017

SteveL-MSFT assigned JamesWTruher Sep 24, 2017

SteveL-MSFT unassigned JamesWTruher Oct 4, 2017

SteveL-MSFT mentioned this issue Oct 5, 2017

Invoke-WebRequest includes byte order mark in content for files encoded as UTF-8-BOM (a la ISE) #5007

Closed

SteveL-MSFT assigned JamesWTruher Oct 22, 2017

SteveL-MSFT modified the milestones: 6.0.0-HighPriority, 6.0.0-RC Nov 3, 2017

mklement0 mentioned this issue Nov 7, 2017

Change $OutputEncoding to use iso-8859-1 encoding rather than ASCII #5361

Closed

JamesWTruher mentioned this issue Nov 7, 2017

Change $OutputEncoding to be utf8 without BOM rather than ASCII #5369

Merged

daxian-dbw closed this as completed Nov 7, 2017

iSazonov added the Resolution-Fixed The issue is fixed. label Nov 8, 2017

ExE-Boss mentioned this issue Jan 23, 2018

The help function breaks when [Console]::OutputEncoding is set to Unicode #5994

Closed

3N4N mentioned this issue Jul 19, 2022

fix: powershell redirection for filtered bang commands neovim/neovim#19438

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prepare for BOM-less UTF-8 default character encoding with respect to $OutputEncoding and console code page #4681

Prepare for BOM-less UTF-8 default character encoding with respect to $OutputEncoding and console code page #4681

mklement0 commented Aug 27, 2017 •

edited

Loading

iSazonov commented Aug 28, 2017

mklement0 commented Aug 28, 2017

iSazonov commented Aug 29, 2017

muzzar78 commented Oct 2, 2017

SteveL-MSFT commented Oct 2, 2017

mklement0 commented Oct 3, 2017

JamesWTruher commented Oct 31, 2017

mklement0 commented Nov 1, 2017 •

edited

Loading

iSazonov commented Nov 1, 2017

JamesWTruher commented Nov 1, 2017

mklement0 commented Nov 2, 2017

mklement0 commented Nov 2, 2017 •

edited

Loading

stknohg commented Nov 7, 2017

JamesWTruher commented Nov 7, 2017

SteveL-MSFT commented Nov 7, 2017

daxian-dbw commented Nov 7, 2017

Prepare for BOM-less UTF-8 default character encoding with respect to $OutputEncoding and console code page #4681

Prepare for BOM-less UTF-8 default character encoding with respect to $OutputEncoding and console code page #4681

Comments

mklement0 commented Aug 27, 2017 • edited Loading

Environment data

iSazonov commented Aug 28, 2017

mklement0 commented Aug 28, 2017

iSazonov commented Aug 29, 2017

muzzar78 commented Oct 2, 2017

SteveL-MSFT commented Oct 2, 2017

mklement0 commented Oct 3, 2017

JamesWTruher commented Oct 31, 2017

mklement0 commented Nov 1, 2017 • edited Loading

iSazonov commented Nov 1, 2017

JamesWTruher commented Nov 1, 2017

mklement0 commented Nov 2, 2017

mklement0 commented Nov 2, 2017 • edited Loading

stknohg commented Nov 7, 2017

JamesWTruher commented Nov 7, 2017

SteveL-MSFT commented Nov 7, 2017

daxian-dbw commented Nov 7, 2017

mklement0 commented Aug 27, 2017 •

edited

Loading

mklement0 commented Nov 1, 2017 •

edited

Loading

mklement0 commented Nov 2, 2017 •

edited

Loading