-
-
Notifications
You must be signed in to change notification settings - Fork 14.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
glibc: add patch to use utf-8 when LANG is unset #61202
Conversation
For reference, this is what musl already does: https://wiki.musl-libc.org/functional-differences-from-glibc.html#Default-locale |
@matthewbauer I would love to see this. However we should also take macOS into account. If we start to use UTF-8 by default on linux we also should on macOS or otherwise we break accidentally packages for macOS. I don't have access to macOS however I roughly described in #54485 (comment) what needs to be done in order to get C.UTF-8 support. |
Have you ever seen this problem appear on macOS? It's possible it happens but I can't recall it happening. Usually these hacks are glibc specific. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How will glibc
provide that UTF-8 locale without referencing the glibcLocale
files? My understanding is that those files are necessary at run-time, but glibc
does not know their location.
Some possible things that may work:
|
I don't think it's necessary for C.UTF-8. It's definitely necessary for en_US.UTF-8, but that is due to the US localizations. UTF-8 and C should be baked in to glibc. |
@matthewbauer you don't want to set a invalid locale on macOS or things start to fail. |
@@ -37,6 +37,7 @@ in rec { | |||
export CMAKE_OSX_ARCHITECTURES=x86_64 | |||
# Workaround for https://openradar.appspot.com/22671534 on 10.11. | |||
export gl_cv_func_getcwd_abort_bug=no | |||
export LC_CTYPE=UTF-8 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be still great if we could set LANG on macOS as well. Reason is, that $LC_CTYPE
has a different semantics/priority compared to $LANG
. LANG has a lower priority then all LC_*
variables, so if an existing package already sets LC_ALL
or similar to some value, it would be preferred.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't seem to work correctly:
$ env -i LANG=UTF-8 locale
LANG="UTF-8"
LC_COLLATE="C"
LC_CTYPE="C"
LC_MESSAGES="C"
LC_MONETARY="C"
LC_NUMERIC="C"
LC_TIME="C"
LC_ALL=
vs.
$ env -i LANG=en_US.UTF-8 locale
LANG="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_CTYPE="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_ALL=
vs.
$ env -i LANG=en_US.UTF-8 locale
LANG=
LC_COLLATE="C"
LC_CTYPE="UTF-8"
LC_MESSAGES="C"
LC_MONETARY="C"
LC_NUMERIC="C"
LC_TIME="C"
LC_ALL=
This is the same thing described in the FreeBSD proposal:
Introduce C.UTF-8 locale, using the same common CTYPE map as other UTF-8 locales ... and having all other components use C locale
UTF-8 is only valid for LC_CTYPE, which makes sense. This is a BSDism it looks like though, so documentation on it would be helpful. My thinking on this is that if a package really needs a non-UTF-8 encoding in 2019, they should specifically request it through setting LC_CTYPE or LC_ALL. Also just to show that LC_ALL does take precedence:
$ env -i LC_ALL=en_US.UTF-8 LC_CTYPE=UTF-8 locale
LANG=
LC_COLLATE="en_US.UTF-8"
LC_CTYPE="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_ALL="en_US.UTF-8"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I meant by having a C.UTF-8 locale generated on macOS instead of using LC_CTYPE=UTF-8
.
The code of glibc does first look into the archive in its own installation prefix before falling back to |
I should verify that c65124a3ce0b59dcf71e4ad9c72f6616f67b7079 works correctly for all of the changed files. It's possible that somewhere the locale vars are set and they specifically want one of the glibcLocales included. This might make sense for certain kinds of tests, but also seems like kind of bad practice. |
I don't know, it seems scary to patch Glibc in this way. Does upstream have an opinion about this? |
@edolstra Upstream ticket and discussions:
It looks like it's part of the roadmap, just never been done? I can't find any distros that have done the switch for the default, though many provide C.UTF-8 in their glibc package (as we already do). It seems reasonable to me to enable UTF-8 by default, though. |
Why do these even use the huge full |
Yeah unfortunately C.UTF-8 only works on Glibc/Musl. So we need to to either use something that works on macOS like en_US.UTF-8 or else apply a patch like this. |
This allows the default builder to use utf-8 encoding. Previously, when LANG and LC_ALL was left unset, glibc would default to “C”. This is not good in many. As a result of this, I think we can get rid of a lot of uses of LC_ALL in Nixpkgs, where utf-8 was needed. These can be found by grepping for: (?:LC_ALL|LANG) *= *["']?(?:en_US|C)\.UTF-8["']?
A couple of places set LANG=“en_US.UTF-8” to get unicode working in the Nix builder. This is very very bad! Not everyone who uses Nix lives in the U.S. and is an English speaker. Instead we should use the “C.UTF-8”, which is an UTF-8 capable version of the unlocalized C locale. Please use this one! The previous commit makes Glibc fall back to this locale when LANG is unset. This is in line with the Musl behavior. As a result, the Nix builder is made UTF-8 capable without any LOCALE_ARCHIVE or LANG hacks.
Unfortunately, the default macOS locale does not have UTF-8 support. You need to set this for correct behavior.
977d329
to
0ebb6d2
Compare
These break due to default locale not being what gnulib expects.
2b3829b
to
9b218d1
Compare
@@ -92,6 +92,11 @@ stdenv.mkDerivation ({ | |||
url = "https://salsa.debian.org/glibc-team/glibc/raw/49767c9f7de4828220b691b29de0baf60d8a54ec/debian/patches/localedata/locale-C.diff"; | |||
sha256 = "0irj60hs2i91ilwg5w7sqrxb695c93xg0ik7yhhq9irprd7fidn4"; | |||
}) | |||
(fetchurl { | |||
url = "https://bugs.debian.org/cgi-bin/bugreport.cgi?att=1;bug=874160;filename=0001-Default-to-C.UTF-8-on-setlocale-.-if-no-env-vars-are.patch;msg=5"; | |||
name = "0001-Default-to-C.UTF-8-on-setlocale-.-if-no-env-vars-are.patch"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So debian switched to C.UTF-8
now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No this is just a proposed patch
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@matthewbauer This needs a rebase :) |
I got a osx vm now and could actually test the patch I proposed here: #54485 (comment) |
Alacritty also switched to |
Instead of patching |
Closing for now - glibc probably shouldn't be patched like this & probably should just set LANG=C.UTF-8 in stdenv. |
This allows the default builder to use utf-8 encoding. Previously,
when LANG and LC_ALL was left unset, glibc would default to “C”. This
is not good in many.
As a result of this, I think we can get rid of a lot of uses of LC_ALL
in Nixpkgs, where utf-8 was needed. These can be found by grepping
for:
(?:LC_ALL|LANG) *= *["']?(?:en_US|C)\.UTF-8["']?
A lot of times the ones using en_US.UTF-8 also add glibcLocales.
/cc @vaibhavsagar