Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make newly created PowerShell files default to UTF-8 *with BOM* to avoid encoding misinterpretation #1771

Closed
mklement0 opened this issue Feb 21, 2019 · 19 comments
Labels
Area-Configuration Feature: VS Code Request to use or implement a VS Code feature. Issue-Enhancement A feature request (enhancement). Resolution-Answered Will close automatically.

Comments

@mklement0
Copy link
Contributor

mklement0 commented Feb 21, 2019

Summary of the new feature

  • VSCode creates UTF-8 files without BOM by default.
  • This causes Windows PowerShell (but not PowerShell Core) to misinterpret any non-ASCII-range characters, because, in the absence of a BOM, it defaults to the system's legacy "ANSI" code page (e.g., Windows-1252).

It seems that this has been a perennial pain point (whose root cause isn't obvious), as evidenced by, for instance, by #629 or this StackOverflow question.

Making the extension default all new PowerShell files to UTF-8 with BOM solves that problem.

Such files would be both cross-edition and cross-platform compatible (given that PowerShell Core still correctly interprets the BOM, even though it doesn't require it).

Proposed technical implementation details

I don't know how it works in the context of extension-specific settings, but in the general settings.json file you can simply add the following:

"[powershell]": {
  "files.encoding": "utf8bom"
}
@rjmholt
Copy link
Collaborator

rjmholt commented Feb 21, 2019

The current big issue here is that the extension is third-party software in the eyes of both VSCode and PowerShell.

There's no exposed API for it to configure this VSCode setting on installation.

See microsoft/vscode#824.

Anyone seeing issues here, please lend your 👍 to microsoft/vscode#824 and take a look at MicrosoftDocs/PowerShell-Docs#3743 (will update to the doc link when it's merged).

@mklement0 I assume that document prompted this issue?

@mklement0
Copy link
Contributor Author

Thanks, @rjmholt - it's unfortunate that there's still no API for this (I've since given the linked issue a thumbs-up).

I assume that document prompted this issue?

No, I wasn't aware of that document (thanks for the link). It was my own experience and seeing people run into the problem on SO (Stack Overflow) that prompted me to create this issue.

I've now (hopefully) given it wider exposure with this SO answer.

@rjmholt
Copy link
Collaborator

rjmholt commented Feb 21, 2019

I've now (hopefully) given it wider exposure with this SO answer.

Ah! I'll link to it in the new doc

@rjmholt rjmholt added Issue-Enhancement A feature request (enhancement). Feature: VS Code Request to use or implement a VS Code feature. Area-General labels Feb 21, 2019
@rkeithhill
Copy link
Collaborator

With the new doc on handling encoding WRT PowerShell and text editors, can we close this?

@mklement0
Copy link
Contributor Author

I suggest keeping this open with an Resolution-External label (or equivalent) to await being able to implement a proper solution via the future API proposed in microsoft/vscode#824.

@rjmholt
Copy link
Collaborator

rjmholt commented Mar 13, 2019

Doc: https://docs.microsoft.com/en-us/powershell/scripting/components/vscode/understanding-file-encoding?view=powershell-6

@irvnriir
Copy link

irvnriir commented Mar 22, 2022

Windows PowerShell just can't pull a single change of the default . better users would be making files with BOM for no reason ..

@andyleejordan andyleejordan added the Needs: Maintainer Attention Maintainer attention needed! label Mar 22, 2022
@andyleejordan
Copy link
Member

@mklement0 nowadays it is possible for the extension to supply a default configuration for powershell files:

"[powershell]": {
    "files.encoding": "utf8bom",
    "files.autoGuessEncoding": true
}

This could go here:

"configurationDefaults": {
"[powershell]": {
"debug.saveBeforeStart": "nonUntitledEditorsInActiveGroup",
"editor.semanticHighlighting.enabled": false,
"editor.wordSeparators": "`~!@#$%^&*()=+[{]}\\|;:'\",.<>/?"
}

I can't think of anything it would break...as you pointed out, PowerShell Core readily accepts UTF8BOM, and it fixes issues with Windows PowerShell. @rjmholt can you think of any reasons not to do this now?

@andyleejordan andyleejordan added Needs: Author Feedback Please give us the requested feedback! and removed Needs: Maintainer Attention Maintainer attention needed! labels Mar 22, 2022
@jborean93
Copy link

I can't think of anything it would break

It will break people relying on shebangs on Linux if this change was to happen. Shebangs rely on the first 2 bytes of the file being 0x23 0x21 and the BOM changes that so a file with a BOM will break that setup.

@mklement0
Copy link
Contributor Author

mklement0 commented Mar 22, 2022

Thanks, @andschwa - this manual configuration option is already a part of the OP; the point of the issue was to have the PowerShell extension apply it automatically.

@jborean93, while breaking shebang functionality with a BOM is a good point in general:

  • you wouldn't normally create shebang-based shell scripts with a .ps1 extension.

  • even if you did, we're not talking about breaking existing scripts, but about a sensible default for new ones.

@ghost ghost added Needs: Maintainer Attention Maintainer attention needed! and removed Needs: Author Feedback Please give us the requested feedback! labels Mar 22, 2022
@JustinGrote
Copy link
Collaborator

Thanks, @andschwa - this manual configuration option is already a part of the OP; the point of the issue was to have the PowerShell extension apply it automatically.

@jborean93, while breaking shebang functionality with a BOM is a good point in general:

  • you wouldn't normally create shebang-based shell scripts with a .ps1 extension.
  • even if you did, we're not talking about breaking existing scripts, but about a sensible default for new ones.

It's a significant item and one that would be very difficult to troubleshot, so I think this should be a toggleable opt-in option at best, not an automatic default, especially since Windows Powershell (5.1) is on deprecated life support.

@jborean93
Copy link

you wouldn't normally create shebang-based shell scripts with a .ps1 extension.

Why not, it's a perfectly valid thing to do on Linux to be able to do ./my_script.ps1 to execute your script without specifying pwsh -File ./my_script.ps1.

even if you did, we're not talking about breaking existing scripts, but about a sensible default for new ones.

It breaks the workflow where people create a script in vscode and want to do chmod +x my_script.ps1; ./my_script.ps1. They would have to then go into VSCode and resave the file as UTF-8 No BOM to get this working.

Both sides have disadvantages, I'm sure you can argue both ways but the question was asked what could it break and this is one of them. Personally I think trying to cater to an effectively EOL product at the expense of the new way forward is digging yourself into a hole you eventually need to get out of in the future.

@andyleejordan
Copy link
Member

I think this should be a toggleable opt-in option at best, not an automatic default

That's essentially where it already is. You can just change the encoding (or the default) yourself in settings.

@andyleejordan
Copy link
Member

Both sides have disadvantages, I'm sure you can argue both ways but the question was asked what could it break and this is one of them. Personally I think trying to cater to an effectively EOL product at the expense of the new way forward is digging yourself into a hole you eventually need to get out of in the future.

I agree with this, and yes breaking shebang would be big IMHO. Thank you for pointing that out...we had this vague feeling there was something big on Linux that it broke but couldn't remember what!

@andyleejordan andyleejordan added Resolution-Answered Will close automatically. and removed Needs: Maintainer Attention Maintainer attention needed! labels Mar 22, 2022
@irvnriir
Copy link

irvnriir commented Mar 22, 2022

Text file creation affects more than a compiler . I also saw multiple free Windows programs not supporting BOM .

Altering VSC behavior set/known by user (, by extension), is basically more unexpected than unusual characters being misinterpreted by specific compiler .

@mklement0
Copy link
Contributor Author

on deprecated life support.
effectively EOL product

Fair enough, but we know how slow and painful such demises are in the Windows world...
In other words: cross-edition scripts are likely to be around for a looong time.

it's a perfectly valid thing [...] to be able to do ./my_script.ps1

It's technically valid, but conceptually ill-advised (as creating shebang-based .sh files is) - the point of creating a shebang-based script is to create an executable (whose underlying engine is an implementation detail) - and executables don't have extensions on Unix-like platforms.

A shebang-based file with extension .ps1 files is particularly problematic in that it it will act differently inside PowerShell, as it still runs in-process there, potentially changing its behavior (it sees the current session's preferences, definitions, ...)

However, I see your point re starting out with a .ps1 file as part of development workflow, and how that could be confusing.

yes breaking shebang would be big IMHO.

In summary: it would break something that shouldn't be done to begin with.


All that said, overall I do agree that BOM-less UTF-8 is the way forward.

@ghost
Copy link

ghost commented Jul 13, 2022

Note that legacy powershell has issues with signed scripts that use UTF-8 no BOM encoding and Unicode characters. One solution is switch another recommended encoding in Windows World: UTF16 LE (with BOM), in order avoid ill-fated UTF8-BOM. This has already been fixed in Powershell 7, but not in native powershell.exe.
Powershell/Powershell#3466

Would default encoding to UTF-16 LE break Unix shebang?

@ghost
Copy link

ghost commented Jul 13, 2022

Thank you for your comment, but please note that this issue has been closed for over a week. For better visibility, consider opening a new issue with a link to this instead.

@jborean93
Copy link

Would default encoding to UTF-16 LE break Unix shebang?

it will, the shebang is checked in the kernel by reading the first few bytes as an ASCII equivalent string. Any BOM on a file will break that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Area-Configuration Feature: VS Code Request to use or implement a VS Code feature. Issue-Enhancement A feature request (enhancement). Resolution-Answered Will close automatically.
Projects
None yet
Development

No branches or pull requests

8 participants