Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change $OutputEncoding to use iso-8859-1 encoding rather than ASCII #5361

Conversation

JamesWTruher
Copy link
Member

This enables a number of useful scenarios.
With this change it is now possible to do the following:

get-content -raw -encoding iso8859 /tmp/archive.tgz | gunzip | tar xvf -

and expand a compressed tar archive.
It also ensures that the bytes/characters passed to a native executable are not changed by the pipeline.
What is passed into the pipeline is passed to the next native application

This does not address #1908, as that requires a larger re-architecture of the pipeline. This PR continues to have a [environment]::newline which is added to the native output which should be addressed when 1908 is fixed.

This is marked as a breaking change because the output of the pipeline is not altered as it was previously, anyone relying on ascii encoding (especially for extended ascii sequences) will no longer have that behavior (broken though it was).

This enables a number of useful scenarios.
With this change it is now possible to do the following:

```powershell
get-content -raw -encoding iso8859 /tmp/archive.tgz | gunzip | tar xvf -
```
and expand a compressed tar archive.
It also ensures that the bytes/characters passed to a native executable are not changed by the pipeline.
What is passed to the pipeline is passed to the next native application.
Copy link
Member

@daxian-dbw daxian-dbw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@JamesWTruher Per our discussion, you are going to add newlines when reading the redirected output from a native command. I don't see that change in this PR. Will that be a separate PR?

@@ -1797,8 +1797,7 @@ internal void Start(Process process, NativeCommandIOFormat inputFormat)
//Get the encoding for writing to native command. Note we get the Encoding
//from the current scope so a script or function can use a different encoding
//than global value.
Encoding pipeEncoding = _command.Context.GetVariableValue(SpecialVariables.OutputEncodingVarPath) as System.Text.Encoding ??
Encoding.ASCII;
Encoding pipeEncoding = _command.Context.GetVariableValue(SpecialVariables.OutputEncodingVarPath) as System.Text.Encoding ?? Utils.iso8859;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor comment: there is an extra space before Utils.iso8859.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed!

@@ -4410,7 +4410,7 @@ .ForwardHelpCategory Cmdlet
// Variable which controls the encoding for piping data to a NativeCommand
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be great if you can add more comments about the characteristics of this encoding and mention that's why we use it as the default output encoding.

encodingToUse = Encoding.GetEncoding(28591);
}
Console.InputEncoding = encodingToUse;
string pipedText = Console.In.ReadToEnd();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should just spit out the characters read in without having the encoding involved? For example, like this:

int ch;
while ((ch = Console.In.Read()) != -1)
{
    Console.WriteLine(ch);
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like that much better, thanks
updated

@daxian-dbw daxian-dbw added the Documentation Needed in this repo Documentation is needed in this repo label Nov 7, 2017
new UTF8Encoding(encoderShouldEmitUTF8Identifier: false);
internal static readonly UTF8Encoding utf8NoBom = new UTF8Encoding(encoderShouldEmitUTF8Identifier: false);
// encoding which is essentially a passthru
internal static readonly Encoding iso8859 = Encoding.GetEncoding(28591);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should use the overload that takes a string instead:

GetEncoding("iso-8859-1");

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

{
// write a string of characters (bytes), this closest resembles a binary
// write the string "test" with an accent over the e
byte[] testbytes = new byte[] { 116, 233, 115, 116 };
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just write all the bytes from 0 to 255?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no need
[char]233 is all that is needed to show the problem, and is readable from the command line if you run it.

}
catch
{
;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like you should fail this and return non-zero exit code if invalid encoding is provided. Also suggest using the string overload version instead of the int making the code easier to read.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i've replaced this with something far simpler (see dongbo's suggestion above)

}
if ( encodingToUse == null )
{
encodingToUse = Encoding.GetEncoding(28591);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use GetEncoding("iso-8859-1")

@mklement0
Copy link
Contributor

This PR is fundamentally misguided, as discussed here.

@SteveL-MSFT
Copy link
Member

Based on the discussion @mklement0 linked, it seems that for console text output, we should default to UTF-8. Binary pipeline is something we'll have to defer to later.

@mklement0
Copy link
Contributor

@SteveL-MSFT: That's a great summary - thank you.

@SteveL-MSFT SteveL-MSFT added this to the 6.0.0-RC milestone Nov 7, 2017
@JamesWTruher
Copy link
Member Author

closing in favor of a new PR which simply changes the output encoding to utf8nobom.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Breaking-Change breaking change that may affect users
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants