New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prepare for BOM-less UTF-8 default character encoding with respect to $OutputEncoding and console code page #4681

Closed
mklement0 opened this Issue Aug 27, 2017 · 16 comments

Comments

Projects
None yet
7 participants
@mklement0
Contributor

mklement0 commented Aug 27, 2017

BOM-less UTF-8 character encoding is coming as the default for PowerShell Core on all platforms.

Two attendant changes are required:

  • Preference variable $OutputEncoding, which currently defaults to ASCII, must default to [System.Text.UTF8Encoding]::new() (UTF-8 with no BOM), or, perhaps preferably, not predefine this variable and default to that encoding (the internally used default) in its absence.

    • $OutputEncoding tells PowerShell what character encoding to use when sending output to external utilities.
  • Console / terminal character encoding:

    • On Windows, [Console]::InputEncoding and [Console]::OutputEncoding must both be set to [System.Text.UTF8Encoding]::new(), which is the equivalent of configuring a console window to use code page 65001 (UTF-8) or executing chcp 65001 before PowerShell is launched.

      • [Console]::OutputEncoding tells PowerShell what encoding to assume when reading output from external utilities.

      • On Windows, the Start Menu shortcut that is created during installation should be preconfigured to open a console window with code page 65001.

        • Conceivably, PowerShell should automatically switch to the 65001 code page in case it is launched from a console window with a different active code page (such as from cmd.exe), though it is worth noting that this change in encoding by default remains in effect until the window is closed (even after exiting PowerShell and returning to cmd.exe; perhaps a warning could be issued on startup).
    • On Unix platforms with UTF-8-based locales, which are the norm these days, no action is required.

      • To be determined: How should the rare event of being invoked from a terminal with a different active character encoding be handled? Changing the encoding on the fly, as on Windows, is not guaranteed to work. Perhaps a warning on startup is sufficient.

Before the above is implemented, the interim workaround to make a console window / terminal use UTF-8 consistently is the following command:

$OutputEncoding = [console]::InputEncoding = [console]::OutputEncoding = [System.Text.UTF8Encoding]::new()

Environment data

PowerShell Core v6.0.0-beta.6

@mklement0 mklement0 changed the title from Prepare for no-BOM UTF-8 default encoding with respect to $OutputEncoding and console code page to Prepare for BOM-less UTF-8 default encoding with respect to $OutputEncoding and console code page Aug 27, 2017

@mklement0 mklement0 changed the title from Prepare for BOM-less UTF-8 default encoding with respect to $OutputEncoding and console code page to Prepare for BOM-less UTF-8 default character encoding with respect to $OutputEncoding and console code page Aug 27, 2017

@iSazonov

This comment has been minimized.

Show comment
Hide comment
@iSazonov

iSazonov Aug 28, 2017

Collaborator

It is already addressed in #4119

Collaborator

iSazonov commented Aug 28, 2017

It is already addressed in #4119

@mklement0

This comment has been minimized.

Show comment
Hide comment
@mklement0

mklement0 Aug 28, 2017

Contributor

Thanks, @iSazonov, good to see how far the work has progressed.

However, as far as I can tell, #4119 addresses neither of the two issues raised here (which is why I called them attendant changes).

Contributor

mklement0 commented Aug 28, 2017

Thanks, @iSazonov, good to see how far the work has progressed.

However, as far as I can tell, #4119 addresses neither of the two issues raised here (which is why I called them attendant changes).

@iSazonov

This comment has been minimized.

Show comment
Hide comment
@iSazonov

iSazonov Aug 29, 2017

Collaborator

4119 was closed w/o merge - We are waiting new PR.

Collaborator

iSazonov commented Aug 29, 2017

4119 was closed w/o merge - We are waiting new PR.

@SteveL-MSFT SteveL-MSFT added this to the 6.0.0-HighPriority milestone Sep 24, 2017

@muzzar78

This comment has been minimized.

Show comment
Hide comment
@muzzar78

muzzar78 Oct 2, 2017

I noticed a difference between PS 5.1 and PS core 6 beta 7 using get-content and the -raw switch with a file that ended with a LF. In 5.1 the LF was read as part of the string and was the last character in my variable, where as in 6beta7 the LF was stripped. Is this by design? If I used "-encoding binary" then both 5.1 and 6 read the LF. This may cause some issues.

muzzar78 commented Oct 2, 2017

I noticed a difference between PS 5.1 and PS core 6 beta 7 using get-content and the -raw switch with a file that ended with a LF. In 5.1 the LF was read as part of the string and was the last character in my variable, where as in 6beta7 the LF was stripped. Is this by design? If I used "-encoding binary" then both 5.1 and 6 read the LF. This may cause some issues.

@SteveL-MSFT

This comment has been minimized.

Show comment
Hide comment
@SteveL-MSFT

SteveL-MSFT Oct 2, 2017

Member

raw should be raw including the EOL. Seems like a bug to me.

Member

SteveL-MSFT commented Oct 2, 2017

raw should be raw including the EOL. Seems like a bug to me.

@mklement0

This comment has been minimized.

Show comment
Hide comment
@mklement0

mklement0 Oct 3, 2017

Contributor

@SteveL-MSFT: Indeed; please see #4980

Contributor

mklement0 commented Oct 3, 2017

@SteveL-MSFT: Indeed; please see #4980

@JamesWTruher

This comment has been minimized.

Show comment
Hide comment
@JamesWTruher

JamesWTruher Oct 31, 2017

Member

I'm not sure that setting outputEncoding to utf8 w/o bom is correct, at least on some platforms. Here's an example, from my MacBook; I have a compressed tar archive, which I would love to unspool as:
gc a.tgz | gunzip | tar xvf -

setting $outputEncoding to utf8nobom doesn't do the trick:

PS> $outputEncoding = $utf8                                                                                                                 
PS> gc -raw f.tgz | gunzip | tar tfv -
gunzip: unknown compression format

It turns out there's a couple of problems;
first when we read the content, we break up the output into lines, but I can fix that with -raw, but that still doesn't work. However, if I read the file and set $outputEncoding with encoding set to iso-8859-1, viola it works (sort of)!

PS> $outputEncoding = $enc
PS> gc -raw -encoding $enc f.tgz | gunzip | tar tvf -                                                                                       
gunzip: (stdin): trailing garbage ignored
-rw-r--r--  0 james  wheel       2 Oct 31 12:39 1.txt
-rw-r--r--  0 james  wheel       2 Oct 31 12:36 2.txt
-rw-r--r--  0 james  wheel       2 Oct 31 12:36 3.txt
...

The last problem is that we seem to tack on [environment]::newline to whatever we push down the pipe (which is causing gunzip to complain about the trailing garbage).

Member

JamesWTruher commented Oct 31, 2017

I'm not sure that setting outputEncoding to utf8 w/o bom is correct, at least on some platforms. Here's an example, from my MacBook; I have a compressed tar archive, which I would love to unspool as:
gc a.tgz | gunzip | tar xvf -

setting $outputEncoding to utf8nobom doesn't do the trick:

PS> $outputEncoding = $utf8                                                                                                                 
PS> gc -raw f.tgz | gunzip | tar tfv -
gunzip: unknown compression format

It turns out there's a couple of problems;
first when we read the content, we break up the output into lines, but I can fix that with -raw, but that still doesn't work. However, if I read the file and set $outputEncoding with encoding set to iso-8859-1, viola it works (sort of)!

PS> $outputEncoding = $enc
PS> gc -raw -encoding $enc f.tgz | gunzip | tar tvf -                                                                                       
gunzip: (stdin): trailing garbage ignored
-rw-r--r--  0 james  wheel       2 Oct 31 12:39 1.txt
-rw-r--r--  0 james  wheel       2 Oct 31 12:36 2.txt
-rw-r--r--  0 james  wheel       2 Oct 31 12:36 3.txt
...

The last problem is that we seem to tack on [environment]::newline to whatever we push down the pipe (which is causing gunzip to complain about the trailing garbage).

@mklement0

This comment has been minimized.

Show comment
Hide comment
@mklement0

mklement0 Nov 1, 2017

Contributor

@JamesWTruher:

$OutputEncoding should only ever apply when sending text to external utilities.

The real problem here - a separate issue - is that PowerShell knows ONLY text; it lacks support for passing binary data through the pipeline.

Get-Content is designed for reading text files and -Raw makes matters worse in this case by reading the entire file at once.
(I always thought -Raw was an unfortunate name for what this switch does.)

While you might think that something like
Get-Content -AsByteStream f.tgz | gunzip | tar tfv -
would work, in reality PowerShell sends the textual, decimal representations of the byte values through the pipeline.

Contrast that with Unix utility cat, which simply copies raw bytes from stdin to stdout - it has no concept of encoding.

A "binary pipeline" has apparently partially been implemented - namely for direct external-utility-to-external-utility piping - but AFAIK there's currently still no way to send binary data from a PowerShell command through the pipeline in raw binary format.

Contributor

mklement0 commented Nov 1, 2017

@JamesWTruher:

$OutputEncoding should only ever apply when sending text to external utilities.

The real problem here - a separate issue - is that PowerShell knows ONLY text; it lacks support for passing binary data through the pipeline.

Get-Content is designed for reading text files and -Raw makes matters worse in this case by reading the entire file at once.
(I always thought -Raw was an unfortunate name for what this switch does.)

While you might think that something like
Get-Content -AsByteStream f.tgz | gunzip | tar tfv -
would work, in reality PowerShell sends the textual, decimal representations of the byte values through the pipeline.

Contrast that with Unix utility cat, which simply copies raw bytes from stdin to stdout - it has no concept of encoding.

A "binary pipeline" has apparently partially been implemented - namely for direct external-utility-to-external-utility piping - but AFAIK there's currently still no way to send binary data from a PowerShell command through the pipeline in raw binary format.

@iSazonov

This comment has been minimized.

Show comment
Hide comment
@iSazonov

iSazonov Nov 1, 2017

Collaborator

Can we detect -AsByteStream and then consider "cmdlet | utility" as binary pipeline?

Collaborator

iSazonov commented Nov 1, 2017

Can we detect -AsByteStream and then consider "cmdlet | utility" as binary pipeline?

@JamesWTruher

This comment has been minimized.

Show comment
Hide comment
@JamesWTruher

JamesWTruher Nov 1, 2017

Member

@mklement0
a byte stream can be considered as a stream of characters. It is the dotnet coercion into an encoded string which is the issue, which is why my selection of iso-8859-1 works - it does not change the representation of bytes (in the case of unicode padded 0, or up to 5(!) bytes in the case of utf7)

At the base of all of this is that the method used to read the data from the process (in corefx) is rendered to a string, which means we need to find an encoding which does not alter the individual characters (as does utf8/unicode, etc) but pass them through unmolested as does iso-8859-1

The selection of iso-8859-1 as $OutputEncoding does what you are looking for, which is why my example as described works. I can get the contents of file in powershell and pipe it to native executables without issue - A binary pipeline

Member

JamesWTruher commented Nov 1, 2017

@mklement0
a byte stream can be considered as a stream of characters. It is the dotnet coercion into an encoded string which is the issue, which is why my selection of iso-8859-1 works - it does not change the representation of bytes (in the case of unicode padded 0, or up to 5(!) bytes in the case of utf7)

At the base of all of this is that the method used to read the data from the process (in corefx) is rendered to a string, which means we need to find an encoding which does not alter the individual characters (as does utf8/unicode, etc) but pass them through unmolested as does iso-8859-1

The selection of iso-8859-1 as $OutputEncoding does what you are looking for, which is why my example as described works. I can get the contents of file in powershell and pipe it to native executables without issue - A binary pipeline

@mklement0

This comment has been minimized.

Show comment
Hide comment
@mklement0

mklement0 Nov 2, 2017

Contributor

@iSazonov: I like the idea.

Contributor

mklement0 commented Nov 2, 2017

@iSazonov: I like the idea.

@mklement0

This comment has been minimized.

Show comment
Hide comment
@mklement0

mklement0 Nov 2, 2017

Contributor

@JamesWTruher

if iso-8859-1 works - it does not change the representation of bytes

Yes, it does change the representation, you just have to find the right characters:

> $outputEncoding = [System.Text.Encoding]::GetEncoding('iso-8859-1')
> '' | cat
?
> '' | grep ''  # no output

has no representation in iso-8859-1 (it is outside the 8-bit range in Unicode), so it gets transliterated to a literal ?.

Now let's try with UTF-8:

> $outputEncoding = [System.Text.UTF8Encoding]::new()
> '' | cat
€
> '' | grep ''

Voila: the UTF-16LE representation of was correctly translated into its UTF-8 representation.

That's why when it comes to text, UTF-8 is the right default value for $OutputEncoding.


When binary output is desired, by contrast, there is no reason to bring character encodings into the picture at all.

Get-Content -AsByteStream is the closest thing we have to bypassing as-string processing, but you cannot send this byte stream through the pipeline as-is, as stated.

@iSazonov's suggestion is promising, but I wonder if it goes far enough; there may be other cases where passing raw bytes through the pipeline from PowerShell is needed.

Contributor

mklement0 commented Nov 2, 2017

@JamesWTruher

if iso-8859-1 works - it does not change the representation of bytes

Yes, it does change the representation, you just have to find the right characters:

> $outputEncoding = [System.Text.Encoding]::GetEncoding('iso-8859-1')
> '' | cat
?
> '' | grep ''  # no output

has no representation in iso-8859-1 (it is outside the 8-bit range in Unicode), so it gets transliterated to a literal ?.

Now let's try with UTF-8:

> $outputEncoding = [System.Text.UTF8Encoding]::new()
> '' | cat
€
> '' | grep ''

Voila: the UTF-16LE representation of was correctly translated into its UTF-8 representation.

That's why when it comes to text, UTF-8 is the right default value for $OutputEncoding.


When binary output is desired, by contrast, there is no reason to bring character encodings into the picture at all.

Get-Content -AsByteStream is the closest thing we have to bypassing as-string processing, but you cannot send this byte stream through the pipeline as-is, as stated.

@iSazonov's suggestion is promising, but I wonder if it goes far enough; there may be other cases where passing raw bytes through the pipeline from PowerShell is needed.

@stknohg

This comment has been minimized.

Show comment
Hide comment
@stknohg

stknohg Nov 7, 2017

Contributor

Hi.
We also need to think about non-English langauage, for example CJK.

# PowerShell 6.0 Beta.9 on CentOS 7.4
PS /> $outputEncoding = [System.Text.Encoding]::GetEncoding('iso-8859-1')
PS /> 'こんにちは世界' | cat
???????
PS /> $outputEncoding = [System.Text.UTF8Encoding]::new()
PS /> 'こんにちは世界' | cat
こんにちは世界

I think $OutputEncoding should be same as [Console]::OutputEncoding.
From CoreFx sources, [Console]::OutputEncoding is UTF-8 on Unix.
[Console]::OutputEncoding depends on GetConsoleOutputCP() function on Windows.

Contributor

stknohg commented Nov 7, 2017

Hi.
We also need to think about non-English langauage, for example CJK.

# PowerShell 6.0 Beta.9 on CentOS 7.4
PS /> $outputEncoding = [System.Text.Encoding]::GetEncoding('iso-8859-1')
PS /> 'こんにちは世界' | cat
???????
PS /> $outputEncoding = [System.Text.UTF8Encoding]::new()
PS /> 'こんにちは世界' | cat
こんにちは世界

I think $OutputEncoding should be same as [Console]::OutputEncoding.
From CoreFx sources, [Console]::OutputEncoding is UTF-8 on Unix.
[Console]::OutputEncoding depends on GetConsoleOutputCP() function on Windows.

@JamesWTruher

This comment has been minimized.

Show comment
Hide comment
@JamesWTruher

JamesWTruher Nov 7, 2017

Member

with outputEncoding set in this way, the following scenario will not work
get-content archive.tgz | gunzip | tar xvf -

Member

JamesWTruher commented Nov 7, 2017

with outputEncoding set in this way, the following scenario will not work
get-content archive.tgz | gunzip | tar xvf -

@SteveL-MSFT

This comment has been minimized.

Show comment
Hide comment
@SteveL-MSFT

SteveL-MSFT Nov 7, 2017

Member

@JamesWTruher understood, I think it's ok for 6.0.0 since that never worked correctly

Member

SteveL-MSFT commented Nov 7, 2017

@JamesWTruher understood, I think it's ok for 6.0.0 since that never worked correctly

@daxian-dbw

This comment has been minimized.

Show comment
Hide comment
@daxian-dbw

daxian-dbw Nov 7, 2017

Member

Close via #5369

Member

daxian-dbw commented Nov 7, 2017

Close via #5369

@daxian-dbw daxian-dbw closed this Nov 7, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment