-
Notifications
You must be signed in to change notification settings - Fork 7.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Prepare for BOM-less UTF-8 default character encoding with respect to $OutputEncoding and console code page #4681
Comments
It is already addressed in #4119 |
4119 was closed w/o merge - We are waiting new PR. |
I noticed a difference between PS 5.1 and PS core 6 beta 7 using get-content and the -raw switch with a file that ended with a LF. In 5.1 the LF was read as part of the string and was the last character in my variable, where as in 6beta7 the LF was stripped. Is this by design? If I used "-encoding binary" then both 5.1 and 6 read the LF. This may cause some issues. |
|
@SteveL-MSFT: Indeed; please see #4980 |
I'm not sure that setting outputEncoding to utf8 w/o bom is correct, at least on some platforms. Here's an example, from my MacBook; I have a compressed tar archive, which I would love to unspool as: setting PS> $outputEncoding = $utf8
PS> gc -raw f.tgz | gunzip | tar tfv -
gunzip: unknown compression format It turns out there's a couple of problems; PS> $outputEncoding = $enc
PS> gc -raw -encoding $enc f.tgz | gunzip | tar tvf -
gunzip: (stdin): trailing garbage ignored
-rw-r--r-- 0 james wheel 2 Oct 31 12:39 1.txt
-rw-r--r-- 0 james wheel 2 Oct 31 12:36 2.txt
-rw-r--r-- 0 james wheel 2 Oct 31 12:36 3.txt
... The last problem is that we seem to tack on [environment]::newline to whatever we push down the pipe (which is causing gunzip to complain about the trailing garbage). |
The real problem here - a separate issue - is that PowerShell knows ONLY text; it lacks support for passing binary data through the pipeline.
While you might think that something like Contrast that with Unix utility A "binary pipeline" |
Can we detect |
@mklement0 At the base of all of this is that the method used to read the data from the process (in corefx) is rendered to a string, which means we need to find an encoding which does not alter the individual characters (as does utf8/unicode, etc) but pass them through unmolested as does iso-8859-1 The selection of iso-8859-1 as |
@iSazonov: I like the idea. |
Yes, it does change the representation, you just have to find the right characters: > $outputEncoding = [System.Text.Encoding]::GetEncoding('iso-8859-1')
> '€' | cat
?
> '€' | grep '€' # no output
Now let's try with UTF-8: > $outputEncoding = [System.Text.UTF8Encoding]::new()
> '€' | cat
€
> '€' | grep '€'
€ Voila: the UTF-16LE representation of That's why when it comes to text, UTF-8 is the right default value for When binary output is desired, by contrast, there is no reason to bring character encodings into the picture at all.
@iSazonov's suggestion is promising, but I wonder if it goes far enough; there may be other cases where passing raw bytes through the pipeline from PowerShell is needed. |
Hi. # PowerShell 6.0 Beta.9 on CentOS 7.4
PS /> $outputEncoding = [System.Text.Encoding]::GetEncoding('iso-8859-1')
PS /> 'こんにちは世界' | cat
???????
PS /> $outputEncoding = [System.Text.UTF8Encoding]::new()
PS /> 'こんにちは世界' | cat
こんにちは世界 I think |
with outputEncoding set in this way, the following scenario will not work |
@JamesWTruher understood, I think it's ok for 6.0.0 since that never worked correctly |
Close via #5369 |
Reverts #16271 Fixs #15913 Problem: Since #16271, `make_filter_cmd` uses `Start-Process` cmdlet to execute the user provided shell command for `:%!`. `Start-Process` requires the command to be split into the shell command and its arguments. This was implemented in #19268 by parsing (splitting the user-provided command at the first space) which didn't handle cases such as -- - commands with escaped space in their filepath - quoted commands with space in their filepath Solution: Use piping. The total shell command formats (excluding noise of unimportant parameters): 1. Before #16271 ```powershell pwsh -C "(shell_cmd) < tmp.in | 2>&1 Out-File -Encoding UTF8 <tmp.out>" # not how powershell commands work ``` 2. Since #16271 ```powershell pwsh -C "Start-Process shell_cmd -RedirectStandardInput <tmp.in> -RedirectStandardOutput <tmp.out>" # doesn't handle executable path with space in it # doesn't write error to <tmp.out> ``` 3. This PR ```powershell pwsh -C "& { Get-Content <tmp.in> | & 'path\with space\to\shell_cmd.exe' arg1 arg2 } 2>&1 | Out-File -Encoding UTF8 <tmp.out>" # also works with forward slash in the filepath # also works with double quotes around shell command ``` After this PR, the user can use the following formats: :%!c:\Program` Files\Git\usr\bin\sort.exe :%!'c:\Program Files\Git\usr\bin\sort.exe' :%!"c:\Program Files\Git\usr\bin\sort.exe" :%!"c:\Program` Files\Git\usr\bin\sort.exe" They can even chain different commands: :%!"c:\Program` Files\Git\usr\bin\sort.exe" | sort.exe -r But if they want to call a stringed executable path, they have to provide the Invoke-Command operator (&). In fact, the first stringed executable path also needs this & operator, but this PR adds that behind the scene. :%!"c:\Program` Files\Git\usr\bin\sort.exe" | sort.exe -r | & 'c:\Program Files\Git\usr\bin\sort.exe' ## What this PR solves - Having to parse the user-provided bang ex-command (for splitting into shell cmd and its args). - Removes a lot of human-unreadable `#ifdef` blocks. - Accepting escaped spaces in executable path. - Accepting quoted string of executable path. - Redirects error and exception to tmp.out (exception for when `wrong_cmd.exe not found`) ## What this PR doesn't solve - Handling wrongly escaped path to executable, which the user may pass because of cmdline tab-completion. #18592 ## Edge cases - (Not handled) If the user themself provides the `&` sign (means `call this.exe` in powershell) - (Not handled) Use `-Encoding utf8` parameter for `Get-Content`? - (Handled) Doesn't write to tmp.out if shell command is not found. - fix: use anonymous function (`{wrong_cmd.exe}`). ## Changes other than `make_filter_cmd()` function - Encoding for piping to external executables. See BOM-less UTF8: PowerShell/PowerShell#4681
BOM-less UTF-8 character encoding is coming as the default for PowerShell Core on all platforms.
Two attendant changes are required:
Preference variable
$OutputEncoding
, which currently defaults to ASCII, must default to[System.Text.UTF8Encoding]::new()
(UTF-8 with no BOM), or, perhaps preferably, not predefine this variable and default to that encoding (the internally used default) in its absence.$OutputEncoding
tells PowerShell what character encoding to use when sending output to external utilities.Console / terminal character encoding:
On Windows,
[Console]::InputEncoding
and[Console]::OutputEncoding
must both be set to[System.Text.UTF8Encoding]::new()
, which is the equivalent of configuring a console window to use code page65001
(UTF-8) or executingchcp 65001
before PowerShell is launched.[Console]::OutputEncoding
tells PowerShell what encoding to assume when reading output from external utilities.On Windows, the Start Menu shortcut that is created during installation should be preconfigured to open a console window with code page
65001
.65001
code page in case it is launched from a console window with a different active code page (such as fromcmd.exe
), though it is worth noting that this change in encoding by default remains in effect until the window is closed (even after exiting PowerShell and returning tocmd.exe
; perhaps a warning could be issued on startup).On Unix platforms with UTF-8-based locales, which are the norm these days, no action is required.
Before the above is implemented, the interim workaround to make a console window / terminal use UTF-8 consistently is the following command:
Environment data
The text was updated successfully, but these errors were encountered: