Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Umlaute brake when using curl.exe #21456

Closed
5 tasks done
kort3x opened this issue Apr 11, 2024 · 6 comments
Closed
5 tasks done

Umlaute brake when using curl.exe #21456

kort3x opened this issue Apr 11, 2024 · 6 comments

Comments

@kort3x
Copy link

kort3x commented Apr 11, 2024

Prerequisites

Steps to reproduce

I did this
curl https://slftool.github.io/data.json

then

(curl https://slftool.github.io/data.json)

Expected behavior

For both commands Umlaute should work like in this line from the output of the first command
"stadt": ["Zweibrücken (Deutschland)", "Zwiesel (Deutschland)", "Zwickau (Deutschland)", "Zürich (Schweiz)", "Zabol (Iran)", "Zagreb (Kroatien)"]

Actual behavior

Umlaute are broken in the output of the second command, the one with parentheses
"stadt": ["Zweibr├╝cken (Deutschland)", "Zwiesel (Deutschland)", "Zwickau (Deutschland)", "Z├╝rich (Schweiz)", "Zabol (Iran)", "Zagreb (Kroatien)"],

As soon as you touch the output of curl they brake.

Environment data

curl -V
curl 8.7.1 (x86_64-w64-mingw32) libcurl/8.7.1


$PSVersionTable
Name                           Value
----                           -----
PSVersion                      7.4.1
PSEdition                      Core
GitCommitId                    7.4.1
OS                             Microsoft Windows 10.0.22621
Platform                       Win32NT
PSCompatibleVersions           {1.0, 2.0, 3.0, 4.0…}
PSRemotingProtocolVersion      2.3
SerializationVersion           1.1.0.1
WSManStackVersion              3.0

Visuals

No response

@kort3x kort3x added the Needs-Triage The issue is new and needs to be triaged by a work group. label Apr 11, 2024
@jborean93
Copy link
Collaborator

jborean93 commented Apr 11, 2024

When you run an executable in PowerShell there are two ways the executable can output text

  • To the console directly
  • To the process' stdout pipe

The first example is what happens when you execute something in PowerShell without capturing or redirecting the output in any way, for example

PS C:\> curl.exe https://slftool.github.io/data.json

The second example is what happens when the process' output is redirect by PowerShell as the data is going to be captured by PowerShell. This can occur in any of the cases like setting it to a var, pipelining data, redirection, or in your case grouping, for example

PS C:\> $out = curl.exe https://slftool.github.io/data.json
PS C:\> curl.exe https://slftool.github.io/data.json | Out-String
PS C:\> (curl.exe https://slftool.github.io/data.json)

# This example no longer applies in pwsh 7.4+ but older ones do
PS C:\> curl.exe https://slftool.github.io/data.json > C:\test.txt

This is important because the second scenario will have PowerShell convert the raw bytes of the process' stdout pipe using the encoding of [Console]::OutputEncoding which on Windows will not default to UTF-8. This explains why running curl just by itself works just fine because curl is writing to the console directly but (curl.exe ..) does not.

What you need to do is ensure that [Console]::OutputEncoding is set to the correct encoding that curl will use to output it's data, most likely you'll need to set it to UTF-8 as most modern applications will default to that to support unicode characters

[Console]::OutputEncoding = [System.Text.UTF8Encoding]::new()
curl.exe ... | Out-String

If the encoding PowerShell uses does not match what curl is encoding its output as you'll get the incorrect characters back. Using your example of ü as an example. This is the unicode character represented by U+00FC. This character when encoded to bytes is [byte[]]@(0xC3, 0xBC) so for that sequence curl is writing those two bytes for that character. On Windows the default console encoding is typically windows-1252 but may differ depending on how the OS is set up. This extended ASCII encoding will most likely treat those 2 bytes as 2 separate chars as they are single byte encoding schemes causing the wrong value to be returned when it is decoded from those bytes.

It's also good practice you set the [Console]::InputEncoding and $OutputEncoding to the same value, e.g.

$OutputEncoding = [Console]::OutputEncoding = [Console]::InputEncoding = [System.Text.UTF8Encoding]::new()

@mklement0
Copy link
Contributor

mklement0 commented Apr 11, 2024

@jborean93's helpful comment was posted while I was in the middle of composing mine, but perhaps the following provides a slightly different angle / supplemental information that may be helpful:

In short:

  • [Console]::OutputEncoding must match the actual character encoding emitted by an external (native) program in order for PowerShell to decode it correctly.

    • Without PowerShell involvement, namely when directly printing to the console (terminal), this aspect doesn't come into play, so that things still print correctly on Windows.
  • Unfortunately, on Windows [Console]::OutputEncoding still defaults to the legacy OEM code page associated with the legacy system locale, such as CP437 on US-English systems - see Make console windows fully UTF-8 by default on Windows, in line with the behavior on Unix-like platforms - character encoding, code page #7233

  • (...) use necessitates decoding by PowerShell, as does assignment to a variable, and sending output to another command through the pipeline (|), and - in v7.3- only - using > / >> to redirect to a file.

    • In v7.4+, direct use of > / >> now relays an external program's raw byte output to the target file, and so does using | when used between external programs.
  • Thus, to capture an external program's UTF-8 output for programmatic processing, you must (temporarily) set [Console]::OutputEncoding = [System.Text.UTF8Encoding]::new()


Note that there is a way to avoid the need for the above:

  • Assuming you're an administrator, you can configure your system to use UTF-8 system-wide (run intl.cpl, activate tab Administrative, press button Change system locale... and activate checkbox Beta: Use Unicode UTF-8 for worldwide language support, then reboot).
  • This sets both the OEM and the ANSI code pages to 65001 (BOM-less UTF-8), but note that doing so has far-reaching consequences that must be carefully considered: see PowerShell use Chinese character “” just like "" ? #21437 (comment)

@rhubarb-geek-nz
Copy link

rhubarb-geek-nz commented Apr 12, 2024

The standard encoding for JSON files is Unicode, typically UTF8.

https://www.json.org/json-en.html

A string is a sequence of zero or more Unicode characters, wrapped in double quotes, using backslash escapes. A character is represented as a single character string. A string is very much like a C or Java string.

When curl is writing data to stdout it should be writing verbatim binary content and not doing any character translation.

@kort3x
Copy link
Author

kort3x commented Apr 12, 2024

Thank you all.

@kort3x kort3x closed this as completed Apr 12, 2024
Copy link
Contributor

microsoft-github-policy-service bot commented Apr 12, 2024

📣 Hey @kort3x, how did we do? We would love to hear your feedback with the link below! 🗣️

🔗 https://aka.ms/PSRepoFeedback

@microsoft-github-policy-service microsoft-github-policy-service bot removed the Needs-Triage The issue is new and needs to be triaged by a work group. label Apr 12, 2024
@kort3x
Copy link
Author

kort3x commented Apr 12, 2024

[Console]::OutputEncoding = [Console]::InputEncoding = [System.Text.UTF8Encoding]::new()

helps with utf8 and external binaries

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants