Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

USB identifiers with funny characters create mojibake #17776

Open
chrysn opened this issue Mar 8, 2022 · 0 comments
Open

USB identifiers with funny characters create mojibake #17776

chrysn opened this issue Mar 8, 2022 · 0 comments
Labels
Type: bug The issue reports a bug / The PR fixes a bug (including spelling errors)

Comments

@chrysn
Copy link
Member

chrysn commented Mar 8, 2022

Description

USB is a protocol ready for the 21st century, so one might be tempted to use that by using friendly labels:

#define CONFIG_USB_PRODUCT_STR  "Schöner USB-Stick"

Sadly, the result reminds one of a dark era of the electronics industry, when lsusb reports Schöner USB-Stick.

A classic case of "written in Unicode, interpreted in latin1".

Interestingly, there is never any latin1 involved -- what happens is that the string enters C as UTF-8 string (as intended), but when it passes through _cpy_str_to_utf16 in usbus_control.c, each byte is emitted as the low half of a UTF-16 word (yes, that's what USB is using internally ... not so 21st century any more, but at least it's real UTF-16 and not UCS-2).

What I'd expect to happen

Yeah, if it were that easy I'd send a PR rather than a rant.

Options are:

  • Do nothing, and point out in the documentation that only ASCII is allowed.
  • Do nothing, and point out in the documentation that only latin1 is allowed (which, by construction of Unicode code points 128-255, also works -- but that's a character encoding I'd rather not have in 21st century documentation)
  • Don't copy over non-ASCII bytes (failing safe -- never mojibake, but it may go unnoticed)
  • Add Unicode support. It's not terribly much code (see the decode_code_point of https://gist.github.com/tylerneylon/9773800, and to support more than BMP, eg. 💾 inside a model name, needs another 3-or-so bit-shift-plus-addition lines and 4 rather than 2 calls to usb_control_slicer_put_char).
  • Add Unicode support but behind a pseudomodule.

@bergzand: preferences?

Quick copy-paste code
// copied over from https://gist.github.com/tylerneylon/9773800
// Stops at any null characters.
int decode_code_point(const char **s) {
  int k = **s ? __builtin_clz(~(**s << 24)) : 0;  // Count # of leading 1 bits.
  int mask = (1 << (8 - k)) - 1;                  // All 1's with k leading 0's.
  int value = **s & mask;
  for (++(*s), --k; k > 0 && **s; --k, ++(*s)) {  // Note that k = #total bytes, or 0.
    value <<= 6;
    value += (**s & 0x3F);
  }
  return value;
}

static size_t _cpy_str_to_utf16(usbus_t *usbus, const char *str)
{
    size_t len = 0;
    uint32_t unichar;

    while (*str) {
        unichar = decode_code_point(&str);

        if (unichar < 0x10000) {
            usbus_control_slicer_put_char(usbus, unichar & 0xff);
            usbus_control_slicer_put_char(usbus, unichar >> 8);
            len += 2;
        } else {
            // followeing https://en.wikipedia.org/wiki/UTF-16#Code_points_from_U+010000_to_U+10FFFF

            uint32_t u = (unichar - 0x10000) & 0xfffff;
            uint16_t w1 = 0xd800 + (u >> 10);
            uint16_t w2 = 0xdc00 + (u & 0x03ff);

            usbus_control_slicer_put_char(usbus, w1 & 0xff);
            usbus_control_slicer_put_char(usbus, w1 >> 8);
            usbus_control_slicer_put_char(usbus, w2 & 0xff);
            usbus_control_slicer_put_char(usbus, w2 >> 8);
            len += 4;
        }
    }
    return len;
}
@maribu maribu added the Type: bug The issue reports a bug / The PR fixes a bug (including spelling errors) label May 22, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: bug The issue reports a bug / The PR fixes a bug (including spelling errors)
Projects
None yet
Development

No branches or pull requests

2 participants