Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"invalid multibyte sequence" error from msgfmt on "¡" #299

Closed
maelle opened this issue Oct 6, 2023 · 11 comments · Fixed by #300
Closed

"invalid multibyte sequence" error from msgfmt on "¡" #299

maelle opened this issue Oct 6, 2023 · 11 comments · Fixed by #300

Comments

@maelle
Copy link
Contributor

maelle commented Oct 6, 2023

👋, thanks for maintaining potools!

I'm writing an example package, and noticed I can't use "¡" in msgid nor msgstr, is that expected?

@MichaelChirico
Copy link
Owner

that sounds wrong to me! can you share more info (the platform you're using, the stack trace)?

@maelle
Copy link
Contributor Author

maelle commented Oct 6, 2023

If in https://github.com/maelle/pockage/blob/a36978a1c06dcdc3dbd6200f4110c2bbaa1ba21b/po/R-es.po#L20 I add "¡" I get

> potools::po_compile()
Recompiling 'ca' R translation
Running system command msgfmt -c --statistics -o './inst/po/ca/LC_MESSAGES/R-pockage.mo' './po/R-ca.po'...
./po/R-ca.po:15:19: invalid multibyte sequence
./po/R-ca.po:15:20: invalid multibyte sequence
msgfmt: found 2 fatal errors
Warning: running msgfmt on R-ca.po failed.
Here is the po file:
msgid ""
msgstr ""
"Project-Id-Version: pockage 0.0.0.9000\n"
"POT-Creation-Date: 2023-10-06 10:45+0200\n"
"PO-Revision-Date: 2023-10-06 10:33+0200\n"
"Last-Translator: Automatically generated\n"
"Language-Team: none\n"
"Language: ca\n"
"MIME-Version: 1.0\n"
"Content-Type: text/plain; charset=ASCII\n"
"Content-Transfer-Encoding: 8bit\n"

#: mensaje.R:9
msgid "user"
msgstr "usuari/usuària"

#: mensaje.R:10
msgid "Hello {name}!"
msgstr "Hola {name}!"
Recompiling 'es' R translation
Running system command msgfmt -c --statistics -o './inst/po/es/LC_MESSAGES/R-pockage.mo' './po/R-es.po'...
./po/R-es.po:20:9: invalid multibyte sequence
./po/R-es.po:20:10: invalid multibyte sequence
msgfmt: found 2 fatal errors
Warning: running msgfmt on R-es.po failed.
Here is the po file:
msgid ""
msgstr ""
"Project-Id-Version: pockage 0.0.0.9000\n"
"POT-Creation-Date: 2023-10-06 10:45+0200\n"
"PO-Revision-Date: 2023-10-06 10:33+0200\n"
"Last-Translator: Automatically generated\n"
"Language-Team: none\n"
"Language: es\n"
"MIME-Version: 1.0\n"
"Content-Type: text/plain; charset=ASCII\n"
"Content-Transfer-Encoding: 8bit\n"
"Plural-Forms: nplurals=2; plural=(n != 1);\n"

#: mensaje.R:9
msgid "user"
msgstr "usuari@"

#: mensaje.R:10
msgid "Hello {name}!"
msgstr "¡Hola {name}!"
Recompiling 'fr' R translation
Running system command msgfmt -c --statistics -o './inst/po/fr/LC_MESSAGES/R-pockage.mo' './po/R-fr.po'...
./po/R-fr.po:16:20: invalid multibyte sequence
./po/R-fr.po:16:21: invalid multibyte sequence
msgfmt: found 2 fatal errors
Warning: running msgfmt on R-fr.po failed.
Here is the po file:
msgid ""
msgstr ""
"Project-Id-Version: pockage 0.0.0.9000\n"
"POT-Creation-Date: 2023-10-06 10:45+0200\n"
"PO-Revision-Date: 2023-10-06 10:33+0200\n"
"Last-Translator: Malle Salmon\n"
"Language-Team: none\n"
"Language: fr\n"
"MIME-Version: 1.0\n"
"Content-Type: text/plain; charset=ASCII\n"
"Content-Transfer-Encoding: 8bit\n"
"Plural-Forms: nplurals=2; plural=(n > 1);\n"

#: mensaje.R:9
msgid "user"
msgstr "utilisateur·rice"

#: mensaje.R:10
msgid "Hello {name}!"
msgstr "Salut {name} !"

This is on:

─ Session info ─────────────────────────────────────────────────────────────────
 setting  value
 version  R version 4.2.0 (2022-04-22)
 os       Ubuntu 20.04.6 LTS
 system   x86_64, linux-gnu
 ui       RStudio
 language en_US.utf8
 collate  en_US.utf8
 ctype    en_US.utf8
 tz       Europe/Paris
 date     2023-10-06
 rstudio  2023.06.2+561 Mountain Hydrangea (desktop)
 pandoc   3.1.1 @ /usr/lib/rstudio/resources/app/bin/quarto/bin/tools/ (via rmarkdown)

I installed potools from GitHub with pak, and didn't have to worry about the system dependency (or maybe I should!).

@maelle
Copy link
Contributor Author

maelle commented Oct 6, 2023

Apparently I also get the error for the slash in the other file https://github.com/maelle/pockage/blob/a36978a1c06dcdc3dbd6200f4110c2bbaa1ba21b/po/R-ca.po#L15 but that wasn't breaking on its own.

@MichaelChirico
Copy link
Owner

The main concern for platform is if this is coming from Windows or not. Definitely surprised this is happening on Ubuntu and hadn't been caught yet! I'll take a look at this soon.

@hadley
Copy link
Collaborator

hadley commented Oct 6, 2023

I know literally nothing about this, but this line caught my eye:

"Content-Type: text/plain; charset=ASCII\n"

Would be worth trying chaning ASCII to UTF-8.

@maelle
Copy link
Contributor Author

maelle commented Oct 9, 2023

@hadley yes, this worked! 🎉

@MichaelChirico
Copy link
Owner

Thanks @hadley!

Maëlle, can I know how that .po file was generated in the first place? Want to make sure {potools} is not emitting any troublesome headers like that.

@MichaelChirico
Copy link
Owner

Looks like {potools} can do so, here's how run_msginit() would work:

msginit -i R-pockage.pot -o R-ja.po -l ja -w 120 --no-translator
grep charset R-ja.po
# "Content-Type: text/plain; charset=ASCII\n"

@MichaelChirico
Copy link
Owner

MichaelChirico commented Oct 22, 2023

I don't see an option for msginit to force it to use charset=UTF-8, looks like it's entirely derived from the header metadata in the .pot file:

‘MIME-Version, Content-Type, Content-Transfer-Encoding’

These values are set according to the content of the POT file and the current locale. If the POT file contains charset=UTF-8, it means that the POT file contains non-ASCII characters, and we keep the UTF-8 encoding. Otherwise, when the POT file is plain ASCII, we use the locale’s encoding.

I had hoped using msginit -l ja.UTF-8 ... would do the trick but no such luck.

If I replace charset=CHARSET with charset=UTF-8 in the .pot file, msginit indeed carries that over to the output .po file.

Looking now how safe it may be to default to charset=UTF-8 in .pot files...

@MichaelChirico
Copy link
Owner

Another note -- looks like there's some conflict b/w po_create() which wraps msginit, vs. write_po_file() which always sets charset=UTF-8:

#' The `charset` for output is always set to `"UTF-8"`; this is
#' intentional to make it more cumbersome to create non-UTF-8 files.

@maelle
Copy link
Contributor Author

maelle commented Oct 23, 2023

I had created the files using potools. Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants