-
Notifications
You must be signed in to change notification settings - Fork 7.7k
Description
This issue has two distinct aspects:
- discussion of an existing documentation bug
- discussion of the problematic fixed default file encoding currently (alpha16) chosen for Core.
Steps to reproduce
'ö' | Set-Content -NoNewline -Encoding ASCII tmp.txt
'ö' | Add-Content -Encoding ASCII -NoNewline tmp.txt
Get-Content -Encoding ASCII tmp.txt
(Get-Content -Encoding Byte -TotalCount 2 tmp.txt) | % { '0x{0:x}' -f $_ }
'--'
'ö' | Set-Content -NoNewline tmp.txt # use default encoding
'ö' | Add-Content -NoNewline tmp.txt # use default encoding
Get-Content tmp.txt # use default encoding
(Get-Content -Encoding Byte -TotalCount 2 tmp.txt) | % { '0x{0:x}' -f $_ }
Expected behavior
??
0x3f
0x3f
--
??
0x3f
0x3f
Actual behavior
??
0x3f
0x3f
--
öö
0xf6
0xf6
That is, ASCII encoding turns a non-ASCII character into literal ?
(0x3f
)
The fact that Set-Content
without an -Encoding
argument resulted in ö
on reading implies that ASCII encoding wasn't used, and the specific byte value of 0xf6
further implies that that a single-byte, extended-ASCII encoding was used:
-
For Windows PowerShell, it is the respective system's legacy codepage ("ANSI"), such as Windows-1252 on US-English systems, or Windows-1251 on Russian systems. In other words: the specific encoding is, to put it in Unix terms, locale-dependent.
-
For PowerShell Core, as of alpha 16, it is ISO-8859-1, as @iSazonov helpfully points out (see his comment below for the source-code links).
- Using a fixed encoding that is limited to 256 code points is problematic, however.
- See @iSazonov's comment below and the discussion of the RFC about default file encodings.
In contrast, Get-Help Set-Content
, Get-Help Add-Content
, and Get-Help Get-Content
state for parameter -Encoding
:
Specifies the file encoding. The default is ASCII.
The help-topic sources (branch live
) for the relevant cmdlets can be found here.
Additionally:
-
While these cmdlets accept an encoding identifier
Default
, as used in other cmdlets, the help only mentionsString
. -
Given that the two appear to result in the same encoding - what is their relationship?
-
The description for encoding
String
in the online help is inadequate:
Uses the encoding type for a string.
Environment data
PowerShell Core v6.0.0-alpha (v6.0.0-alpha.16) on Microsoft Windows 10 Pro (64-bit; v10.0.14393)