Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Need "ANSI" encoding enumeration value to support "ANSI"-code-page-encoded files (e.g., Windows 1252) #6562

Closed
mklement0 opened this issue Apr 4, 2018 · 12 comments · Fixed by #19298
Labels
Issue-Discussion the issue may not have a clear classification yet. The issue may generate an RFC or may be reclassif Resolution-Fixed The issue is fixed.

Comments

@mklement0
Copy link
Contributor

mklement0 commented Apr 4, 2018

As discussed in #6550:

While you can pass OEM to filesystem cmdlets to support the legacy system locale's OEM code page on Windows, its "ANSI" counterpart (such as Windows 1252 on US-English systems) is currently missing.
(In Windows PowerShell, the Default value fulfills that role, but in PowerShell Core Default now refers to the new default, (BOM-less) UTF-8.)

Therefore, an ANSI encoding value should be introduced to complement the OEM value.

With ANSI available, the current workaround:

Get-Content -Encoding ([cultureinfo]::CurrentCulture.TextInfo.ANSICodePage) file.txt

would simply become:

Get-Content -Encoding Ansi file.txt   # Wishful thinking.

Note: Given that OEM already is available even when running on Unix-like platforms, it sounds like we shouldn't restrict ANSI's availability to Windows. ([System.Text.Encoding]::GetEncoding([cultureinfo]::CurrentCulture.TextInfo.ANSICodePage)seemingly does return a locale-appropriate value on Unix-like platforms as well.)

Environment data

Written as of:

PowerShell Core v6.0.2
@mklement0 mklement0 changed the title Need "ANSI" encoding enumeration value to support system-locale "ANSI"-code-page-encoded files (e.g., Windows 1252) Need "ANSI" encoding enumeration value to support "ANSI"-code-page-encoded files (e.g., Windows 1252) Apr 4, 2018
@iSazonov
Copy link
Collaborator

iSazonov commented Apr 5, 2018

We don't use numbers with OEM and seems we should use only ANSI without number as an alias of CurrentCulture.TextInfo.ANSICodePage.

@iSazonov iSazonov added the Issue-Discussion the issue may not have a clear classification yet. The issue may generate an RFC or may be reclassif label Apr 5, 2018
@mklement0
Copy link
Contributor Author

@iSazonov: Agreed.

@chuanjiao10: The purpose of this issue is to restore accidentally removed functionality to PS Core: support for the active ANSI code page.

What you're proposing is an enhancement (as an aside: something like -Encoding Ansi 936 wouldn't work for syntax reasons), so I suggest you open a new issue.
My syntax proposal for such an enhancement would be to allow numerical values as the -Encoding argument to directly represent the code pages by their numbers; e.g., -Encoding 936 for the ANSI/OEM Simplified Chinese (PRC, Singapore); Chinese Simplified (GB2312) code page.

@iSazonov
Copy link
Collaborator

iSazonov commented Apr 5, 2018

My syntax proposal for such an enhancement would be to allow numerical values

Please open new Issue. We should discuss this. (and why [System.Text.Encoding]::GetEncodings() returns only short list although we load System.Text.Encoding.Pages.dll)

@mklement0
Copy link
Contributor Author

@iSazonov: Yes, it should be a new issue, but I was suggesting that @chuanjiao10 create it (I only suggested a possible syntax).

Interesting about the short list of Unix - hadn't noticed that - perhaps yet another issue.

@mklement0
Copy link
Contributor Author

@iSazonov: Just as a quick pointer regarding the "short list":

[System.Text.Encoding]::GetEncodings() only ever reflects the encodings available by default in .NET Core, even if additional ones were registered registered via [System.Text.Encoding]::RegisterProvider() later; sadly, [System.Text.CodePagesEncodingProvider]::Instance has NO equivalent method for enumerating the encodings it implements.

@iSazonov
Copy link
Collaborator

iSazonov commented Apr 6, 2018

New issue for [System.Text.Encoding]::GetEncodings() discussion #6580.

@iSazonov
Copy link
Collaborator

iSazonov commented Apr 6, 2018

New issue for "to allow numerical values" discussion #6581

@mklement0
Copy link
Contributor Author

I appreciate it, @iSazonov.

@jongross4
Copy link

jongross4 commented Aug 13, 2018

I would like to echo the concerns above. The current list of encodings is too limiting. I am dealing with text encoded SHIFT_JIS (cp932) on OEM-US (cp437) and need to get the text to Unicode. Currently working around it with a Get-EncodedContent function that takes all of the named Encodings as a result of [system.text.encoding]::GetEncodings() and then using [system.io.file]::ReadAllLines($Path,$Encoding) as a workaround. Even using encoding RAW would destroy the SHIFT_JIS text on my system.

This is a work in progress but should help others work around the issue in the meantime:

function Get-EncodedContent {
[CmdletBinding()]
param (
$Path
)
DynamicParam {
$ParamName = 'CodePage'
$attributes = new-object System.Management.Automation.ParameterAttribute
$attributes.ParameterSetName = '__AllParameterSets'
$attributes.Mandatory = $false
$attributeCollection = new-object -Type System.Collections.ObjectModel.Collection[System.Attribute]
$attributeCollection.Add($attributes)
$_Values = ([System.Text.Encoding]::GetEncodings()).codepage
$ValidateSet = new-object System.Management.Automation.ValidateSetAttribute($_Values)
$attributeCollection.Add($ValidateSet)
$dynParam1 = new-object -Type System.Management.Automation.RuntimeDefinedParameter($ParamName, [String], $attributeCollection)
$paramDictionary = new-object -Type System.Management.Automation.RuntimeDefinedParameterDictionary
$paramDictionary.Add($ParamName, $dynParam1)
return $paramDictionary
}

begin {
    $CodePage = [int]($PSBoundParameters.CodePage)
    $TextEncoding = [system.text.encoding]::GetEncoding($CodePage)
}

process {
    [System.IO.File]::ReadAllLines((get-item $Path), $TextEncoding)
}

end {
}

}

@sba923
Copy link
Contributor

sba923 commented Jul 18, 2019

Thanks for the snippet. That's better than my current way to work around this issue in my scripts:

    $iswinps = ($null, 'Desktop') -contains $PSVersionTable.PSEdition
    if (!$iswinps)
    {
        $encoding = [System.Text.Encoding]::GetEncoding(1252)
    }
    else
    {
        $encoding = [Microsoft.PowerShell.Commands.FileSystemCmdletProviderEncoding]::Default
    }
    
    Get-Content -Encoding $encoding ...

@mklement0
Copy link
Contributor Author

mklement0 commented Jan 21, 2020

Let me summarize the status quo as of PowerShell Core 7.0.0-rc.2:

  • Ansi as an -Encoding argument is still not supported (which would only be relevant on Windows, where it should refer to whatever the active ANSI code page happens to be, to match the default encoding applied by Windows PowerShell).

  • However, passing specific code-page numbers (e.g., -Encoding 930) or encoding names (e.g., -Encoding shift_jis) is now supported - no more workarounds needed per se - unless you want tab-completion.

    • You should be able to find the list of supported encoding names / code pages with [Text.Encoding]::GetEncodings().Name / [Text.Encoding]::GetEncodings().CodePage but on PS Core you can't, due to lack of CoreFX API support - see https://github.com/dotnet/corefx/issues/28944
    • Even though not all encodings are listed, they still work, however. You can see the current list in the function below.

Tab-completion would be nice, however; here's a proof-of-concept function adapted from @jongross4's workaround; it supports both code-page numbers and encoding names for tab completion, along with PowerShell's own identifiers if you type Get-EncodedContent -Encoding <tab>

function Get-EncodedContent {

  [CmdletBinding()]
  param (
    $Path
  )

  DynamicParam {
    $paramName = 'Encoding'
    $codePageNums = [Text.Encoding]::GetEncodings().CodePage
    $encodingNames = [Text.Encoding]::GetEncodings().Name
    # PowerShell's valid -Encoding arguments - sans 'Unknown' and 'String'
    $psEncodingNames = 'Unicode', 'Byte', 'BigEndianUnicode', 'UTF8', 'UTF7', 'UTF32', 'Ascii', 'Default', 'Oem', 'BigEndianUTF32'
    if ($codePageNums -notcontains 1252) {
      # Workaround for PS Core as of v7: only the .NET Core default set is listed, not also those added later by PowerShell - see https://github.com/dotnet/corefx/issues/28944
      # We use hard-coded lists obtained via Windows PowerShell:
      #     ([Text.Encoding]::GetEncodings().CodePage) -join ', '
      #     "'{0}'" -f (([Text.Encoding]::GetEncodings().Name) -join "', '")
      $codePageNums = 37, 437, 500, 708, 720, 737, 775, 850, 852, 855, 857, 858, 860, 861, 862, 863, 864, 865, 866, 869, 870, 874, 875, 932, 936, 949, 950, 1026, 1047, 1140, 1141, 1142, 1143, 1144, 1145, 1146, 1147, 1148, 1149, 1200, 1201, 1250, 1251, 1252, 1253, 1254, 1255, 1256, 1257, 1258, 1361, 10000, 10001, 10002, 10003, 10004, 10005, 10006, 10007, 10008, 10010, 10017, 10021, 10029, 10079, 10081, 10082, 12000, 12001, 20000, 20001, 20002, 20003, 20004, 20005, 20105, 20106, 20107, 20108, 20127, 20261, 20269, 20273, 20277, 20278, 20280, 20284, 20285, 20290, 20297, 20420, 20423, 20424, 20833, 20838, 20866, 20871, 20880, 20905, 20924, 20932, 20936, 20949, 21025, 21866, 28591, 28592, 28593, 28594, 28595, 28596, 28597, 28598, 28599, 28603, 28605, 29001, 38598, 50220, 50221, 50222, 50225, 50227, 51932, 51936, 51949, 52936, 54936, 57002, 57003, 57004, 57005, 57006, 57007, 57008, 57009, 57010, 57011, 65000, 65001
      $encodingNames = 'IBM037', 'IBM437', 'IBM500', 'ASMO-708', 'DOS-720', 'ibm737', 'ibm775', 'ibm850', 'ibm852', 'IBM855', 'ibm857', 'IBM00858', 'IBM860', 'ibm861', 'DOS-862', 'IBM863', 'IBM864', 'IBM865', 'cp866', 'ibm869', 'IBM870', 'windows-874', 'cp875', 'shift_jis', 'gb2312', 'ks_c_5601-1987', 'big5', 'IBM1026', 'IBM01047', 'IBM01140', 'IBM01141', 'IBM01142', 'IBM01143', 'IBM01144', 'IBM01145', 'IBM01146', 'IBM01147', 'IBM01148', 'IBM01149', 'utf-16', 'utf-16BE', 'windows-1250', 'windows-1251', 'Windows-1252', 'windows-1253', 'windows-1254', 'windows-1255', 'windows-1256', 'windows-1257', 'windows-1258', 'Johab', 'macintosh', 'x-mac-japanese', 'x-mac-chinesetrad', 'x-mac-korean', 'x-mac-arabic', 'x-mac-hebrew', 'x-mac-greek', 'x-mac-cyrillic', 'x-mac-chinesesimp', 'x-mac-romanian', 'x-mac-ukrainian', 'x-mac-thai', 'x-mac-ce', 'x-mac-icelandic', 'x-mac-turkish', 'x-mac-croatian', 'utf-32', 'utf-32BE', 'x-Chinese-CNS', 'x-cp20001', 'x-Chinese-Eten', 'x-cp20003', 'x-cp20004', 'x-cp20005', 'x-IA5', 'x-IA5-German', 'x-IA5-Swedish', 'x-IA5-Norwegian', 'us-ascii', 'x-cp20261', 'x-cp20269', 'IBM273', 'IBM277', 'IBM278', 'IBM280', 'IBM284', 'IBM285', 'IBM290', 'IBM297', 'IBM420', 'IBM423', 'IBM424', 'x-EBCDIC-KoreanExtended', 'IBM-Thai', 'koi8-r', 'IBM871', 'IBM880', 'IBM905', 'IBM00924', 'EUC-JP', 'x-cp20936', 'x-cp20949', 'cp1025', 'koi8-u', 'iso-8859-1', 'iso-8859-2', 'iso-8859-3', 'iso-8859-4', 'iso-8859-5', 'iso-8859-6', 'iso-8859-7', 'iso-8859-8', 'iso-8859-9', 'iso-8859-13', 'iso-8859-15', 'x-Europa', 'iso-8859-8-i', 'iso-2022-jp', 'csISO2022JP', 'iso-2022-jp', 'iso-2022-kr', 'x-cp50227', 'euc-jp', 'EUC-CN', 'euc-kr', 'hz-gb-2312', 'GB18030', 'x-iscii-de', 'x-iscii-be', 'x-iscii-ta', 'x-iscii-te', 'x-iscii-as', 'x-iscii-or', 'x-iscii-ka', 'x-iscii-ma', 'x-iscii-gu', 'x-iscii-pa', 'utf-7', 'utf-8'
    }
    $validateSet = [Management.Automation.ValidateSetAttribute]::new([string[]] ($codePageNums + $encodingNames + $psEncodingNames))
    $dynParam = [Management.Automation.RuntimeDefinedParameter]::new(
      $paramName, 
      [string], 
      ([Management.Automation.ParameterAttribute] @{ ParameterSetName = '__AllParameterSets' }, $validateSet)
    )
    ($paramDictionary = [Management.Automation.RuntimeDefinedParameterDictionary]::new()).Add($paramName, $dynParam)
    return $paramDictionary
  }
  
  end {

    Set-StrictMode -Version 1

    if (($encoding = $PSBoundParameters.Encoding)) { # -Encoding specified.

      $isPSCore = $PSVersionTable.PSEdition -eq 'Core'
      $isPsIdentifier = $false
      if ($encoding -as [int]) { # code page
        # If a code-page number was given, make it an [int].
        $encoding = [int] $encoding 
      } else { # name
        # See if the identifier is a standard PS encoding identifier.
        $isPsIdentifier = 'Unicode', 'Byte', 'BigEndianUnicode', 'UTF8', 'UTF7', 'UTF32', 'Ascii', 'Default', 'Oem', 'BigEndianUTF32' -contains $encoding
      }

      # In PS Core we can always pass the -Encoding argument through,
      # in Win PS only if it is a standard identifier.
      if ($isPSCore -or $isPsIdentifier) { 

        # Workaround for PS Core as of v7.0 for 'BigEndianUTF32' not being suported - see https://github.com/PowerShell/PowerShell/issues/11645
        # Translate to the equivalent System.Text.Encoding name.
        if ($isPSCore -and $encoding -eq 'BigEndianUTF32') { $encoding = 'UTF-32BE' }

        Get-Content $Path -Encoding $encoding

      }
      else { # WinPS - obtain a System.Text.Encoding instance and use [IO.File]::ReadAllLines()

        # Caveat: This doesn't *stream* through the pipeline - it reads all lines *up front*
        [IO.File]::ReadAllLines((Convert-Path $Path), [Text.Encoding]::GetEncoding($encoding))

      }

    }
    else { # -Encoding not specified -> simply invoke Get-Content
      Get-Content -Path $path
    }

  }

}

@ghost
Copy link

ghost commented Mar 14, 2023

🎉This issue was addressed in #19298, which has now been successfully released as v7.4.0-preview.2.:tada:

Handy links:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Issue-Discussion the issue may not have a clear classification yet. The issue may generate an RFC or may be reclassif Resolution-Fixed The issue is fixed.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants
@mklement0 @jongross4 @sba923 @iSazonov and others