Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prepare for BOM-less UTF-8 default character encoding with respect to $OutputEncoding and console code page #4681

Closed
mklement0 opened this issue Aug 27, 2017 · 16 comments
Assignees
Labels
Resolution-Fixed The issue is fixed. WG-Engine core PowerShell engine, interpreter, and runtime
Milestone

Comments

@mklement0
Copy link
Contributor

mklement0 commented Aug 27, 2017

BOM-less UTF-8 character encoding is coming as the default for PowerShell Core on all platforms.

Two attendant changes are required:

  • Preference variable $OutputEncoding, which currently defaults to ASCII, must default to [System.Text.UTF8Encoding]::new() (UTF-8 with no BOM), or, perhaps preferably, not predefine this variable and default to that encoding (the internally used default) in its absence.

    • $OutputEncoding tells PowerShell what character encoding to use when sending output to external utilities.
  • Console / terminal character encoding:

    • On Windows, [Console]::InputEncoding and [Console]::OutputEncoding must both be set to [System.Text.UTF8Encoding]::new(), which is the equivalent of configuring a console window to use code page 65001 (UTF-8) or executing chcp 65001 before PowerShell is launched.

      • [Console]::OutputEncoding tells PowerShell what encoding to assume when reading output from external utilities.

      • On Windows, the Start Menu shortcut that is created during installation should be preconfigured to open a console window with code page 65001.

        • Conceivably, PowerShell should automatically switch to the 65001 code page in case it is launched from a console window with a different active code page (such as from cmd.exe), though it is worth noting that this change in encoding by default remains in effect until the window is closed (even after exiting PowerShell and returning to cmd.exe; perhaps a warning could be issued on startup).
    • On Unix platforms with UTF-8-based locales, which are the norm these days, no action is required.

      • To be determined: How should the rare event of being invoked from a terminal with a different active character encoding be handled? Changing the encoding on the fly, as on Windows, is not guaranteed to work. Perhaps a warning on startup is sufficient.

Before the above is implemented, the interim workaround to make a console window / terminal use UTF-8 consistently is the following command:

$OutputEncoding = [console]::InputEncoding = [console]::OutputEncoding = [System.Text.UTF8Encoding]::new()

Environment data

PowerShell Core v6.0.0-beta.6
@mklement0 mklement0 changed the title Prepare for no-BOM UTF-8 default encoding with respect to $OutputEncoding and console code page Prepare for BOM-less UTF-8 default encoding with respect to $OutputEncoding and console code page Aug 27, 2017
@mklement0 mklement0 changed the title Prepare for BOM-less UTF-8 default encoding with respect to $OutputEncoding and console code page Prepare for BOM-less UTF-8 default character encoding with respect to $OutputEncoding and console code page Aug 27, 2017
@iSazonov
Copy link
Collaborator

It is already addressed in #4119

@iSazonov iSazonov added the WG-Engine core PowerShell engine, interpreter, and runtime label Aug 28, 2017
@mklement0
Copy link
Contributor Author

Thanks, @iSazonov, good to see how far the work has progressed.

However, as far as I can tell, #4119 addresses neither of the two issues raised here (which is why I called them attendant changes).

@iSazonov
Copy link
Collaborator

4119 was closed w/o merge - We are waiting new PR.

@SteveL-MSFT SteveL-MSFT added this to the 6.0.0-HighPriority milestone Sep 24, 2017
@muzzar78
Copy link

muzzar78 commented Oct 2, 2017

I noticed a difference between PS 5.1 and PS core 6 beta 7 using get-content and the -raw switch with a file that ended with a LF. In 5.1 the LF was read as part of the string and was the last character in my variable, where as in 6beta7 the LF was stripped. Is this by design? If I used "-encoding binary" then both 5.1 and 6 read the LF. This may cause some issues.

@SteveL-MSFT
Copy link
Member

raw should be raw including the EOL. Seems like a bug to me.

@mklement0
Copy link
Contributor Author

@SteveL-MSFT: Indeed; please see #4980

@JamesWTruher
Copy link
Member

I'm not sure that setting outputEncoding to utf8 w/o bom is correct, at least on some platforms. Here's an example, from my MacBook; I have a compressed tar archive, which I would love to unspool as:
gc a.tgz | gunzip | tar xvf -

setting $outputEncoding to utf8nobom doesn't do the trick:

PS> $outputEncoding = $utf8                                                                                                                 
PS> gc -raw f.tgz | gunzip | tar tfv -
gunzip: unknown compression format

It turns out there's a couple of problems;
first when we read the content, we break up the output into lines, but I can fix that with -raw, but that still doesn't work. However, if I read the file and set $outputEncoding with encoding set to iso-8859-1, viola it works (sort of)!

PS> $outputEncoding = $enc
PS> gc -raw -encoding $enc f.tgz | gunzip | tar tvf -                                                                                       
gunzip: (stdin): trailing garbage ignored
-rw-r--r--  0 james  wheel       2 Oct 31 12:39 1.txt
-rw-r--r--  0 james  wheel       2 Oct 31 12:36 2.txt
-rw-r--r--  0 james  wheel       2 Oct 31 12:36 3.txt
...

The last problem is that we seem to tack on [environment]::newline to whatever we push down the pipe (which is causing gunzip to complain about the trailing garbage).

@mklement0
Copy link
Contributor Author

mklement0 commented Nov 1, 2017

@JamesWTruher:

$OutputEncoding should only ever apply when sending text to external utilities.

The real problem here - a separate issue - is that PowerShell knows ONLY text; it lacks support for passing binary data through the pipeline.

Get-Content is designed for reading text files and -Raw makes matters worse in this case by reading the entire file at once.
(I always thought -Raw was an unfortunate name for what this switch does.)

While you might think that something like
Get-Content -AsByteStream f.tgz | gunzip | tar tfv -
would work, in reality PowerShell sends the textual, decimal representations of the byte values through the pipeline.

Contrast that with Unix utility cat, which simply copies raw bytes from stdin to stdout - it has no concept of encoding.

A "binary pipeline" has apparently partially been implemented - namely for direct external-utility-to-external-utility piping [update: seemingly not, as of v6.2.0_] - but AFAIK there's currently still no way to send binary data from a PowerShell command through the pipeline in raw binary format.

@iSazonov
Copy link
Collaborator

iSazonov commented Nov 1, 2017

Can we detect -AsByteStream and then consider "cmdlet | utility" as binary pipeline?

@JamesWTruher
Copy link
Member

@mklement0
a byte stream can be considered as a stream of characters. It is the dotnet coercion into an encoded string which is the issue, which is why my selection of iso-8859-1 works - it does not change the representation of bytes (in the case of unicode padded 0, or up to 5(!) bytes in the case of utf7)

At the base of all of this is that the method used to read the data from the process (in corefx) is rendered to a string, which means we need to find an encoding which does not alter the individual characters (as does utf8/unicode, etc) but pass them through unmolested as does iso-8859-1

The selection of iso-8859-1 as $OutputEncoding does what you are looking for, which is why my example as described works. I can get the contents of file in powershell and pipe it to native executables without issue - A binary pipeline

@mklement0
Copy link
Contributor Author

@iSazonov: I like the idea.

@mklement0
Copy link
Contributor Author

mklement0 commented Nov 2, 2017

@JamesWTruher

if iso-8859-1 works - it does not change the representation of bytes

Yes, it does change the representation, you just have to find the right characters:

> $outputEncoding = [System.Text.Encoding]::GetEncoding('iso-8859-1')
> '' | cat
?
> '' | grep ''  # no output

has no representation in iso-8859-1 (it is outside the 8-bit range in Unicode), so it gets transliterated to a literal ?.

Now let's try with UTF-8:

> $outputEncoding = [System.Text.UTF8Encoding]::new()
> '' | cat
€
> '' | grep ''

Voila: the UTF-16LE representation of was correctly translated into its UTF-8 representation.

That's why when it comes to text, UTF-8 is the right default value for $OutputEncoding.


When binary output is desired, by contrast, there is no reason to bring character encodings into the picture at all.

Get-Content -AsByteStream is the closest thing we have to bypassing as-string processing, but you cannot send this byte stream through the pipeline as-is, as stated.

@iSazonov's suggestion is promising, but I wonder if it goes far enough; there may be other cases where passing raw bytes through the pipeline from PowerShell is needed.

@stknohg
Copy link
Contributor

stknohg commented Nov 7, 2017

Hi.
We also need to think about non-English langauage, for example CJK.

# PowerShell 6.0 Beta.9 on CentOS 7.4
PS /> $outputEncoding = [System.Text.Encoding]::GetEncoding('iso-8859-1')
PS /> 'こんにちは世界' | cat
???????
PS /> $outputEncoding = [System.Text.UTF8Encoding]::new()
PS /> 'こんにちは世界' | cat
こんにちは世界

I think $OutputEncoding should be same as [Console]::OutputEncoding.
From CoreFx sources, [Console]::OutputEncoding is UTF-8 on Unix.
[Console]::OutputEncoding depends on GetConsoleOutputCP() function on Windows.

@JamesWTruher
Copy link
Member

with outputEncoding set in this way, the following scenario will not work
get-content archive.tgz | gunzip | tar xvf -

@SteveL-MSFT
Copy link
Member

@JamesWTruher understood, I think it's ok for 6.0.0 since that never worked correctly

@daxian-dbw
Copy link
Member

Close via #5369

@iSazonov iSazonov added the Resolution-Fixed The issue is fixed. label Nov 8, 2017
justinmk added a commit to neovim/neovim that referenced this issue Oct 1, 2022
Reverts #16271
Fixs #15913

Problem:
Since #16271, `make_filter_cmd` uses `Start-Process` cmdlet to execute the user
provided shell command for `:%!`. `Start-Process` requires the command to be
split into the shell command and its arguments. This was implemented in #19268
by parsing (splitting the user-provided command at the first space) which didn't
handle cases such as --
  - commands with escaped space in their filepath
  - quoted commands with space in their filepath

Solution: Use piping.

The total shell command formats (excluding noise of unimportant parameters):

1. Before #16271
    ```powershell
    pwsh -C "(shell_cmd) < tmp.in | 2>&1 Out-File -Encoding UTF8 <tmp.out>"
    # not how powershell commands work
    ```
2. Since #16271
    ```powershell
    pwsh -C "Start-Process shell_cmd -RedirectStandardInput <tmp.in> -RedirectStandardOutput <tmp.out>"
    # doesn't handle executable path with space in it
    # doesn't write error to <tmp.out>
    ```
3. This PR
    ```powershell
    pwsh -C "& { Get-Content <tmp.in> | & 'path\with space\to\shell_cmd.exe' arg1 arg2 } 2>&1 | Out-File -Encoding UTF8 <tmp.out>"
    # also works with forward slash in the filepath
    # also works with double quotes around shell command
    ```

After this PR, the user can use the following formats:

    :%!c:\Program` Files\Git\usr\bin\sort.exe
    :%!'c:\Program Files\Git\usr\bin\sort.exe'
    :%!"c:\Program Files\Git\usr\bin\sort.exe"
    :%!"c:\Program` Files\Git\usr\bin\sort.exe"

They can even chain different commands:

    :%!"c:\Program` Files\Git\usr\bin\sort.exe" | sort.exe -r

But if they want to call a stringed executable path, they have to provide the
Invoke-Command operator (&). In fact, the first stringed executable path also
needs this & operator, but this PR adds that behind the scene.

    :%!"c:\Program` Files\Git\usr\bin\sort.exe" | sort.exe -r | & 'c:\Program Files\Git\usr\bin\sort.exe'

## What this PR solves

- Having to parse the user-provided bang ex-command (for splitting into shell
  cmd and its args).
- Removes a lot of human-unreadable `#ifdef` blocks.
- Accepting escaped spaces in executable path.
- Accepting quoted string of executable path.
- Redirects error and exception to tmp.out (exception for when `wrong_cmd.exe
  not found`)

## What this PR doesn't solve

- Handling wrongly escaped path to executable, which the user may pass because
  of cmdline tab-completion. #18592

## Edge cases
- (Not handled) If the user themself provides the `&` sign (means `call
  this.exe` in powershell)
- (Not handled) Use `-Encoding utf8` parameter for `Get-Content`?
- (Handled) Doesn't write to tmp.out if shell command is not found.
    - fix: use anonymous function (`{wrong_cmd.exe}`).

## Changes other than `make_filter_cmd()` function

- Encoding for piping to external executables. See BOM-less UTF8:
  PowerShell/PowerShell#4681
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Resolution-Fixed The issue is fixed. WG-Engine core PowerShell engine, interpreter, and runtime
Projects
None yet
Development

No branches or pull requests

7 participants