USB identifiers with funny characters create mojibake #17776

chrysn · 2022-03-08T20:51:27Z

Description

USB is a protocol ready for the 21st century, so one might be tempted to use that by using friendly labels:

#define CONFIG_USB_PRODUCT_STR  "Schöner USB-Stick"

Sadly, the result reminds one of a dark era of the electronics industry, when lsusb reports SchÃ¶ner USB-Stick.

A classic case of "written in Unicode, interpreted in latin1".

Interestingly, there is never any latin1 involved -- what happens is that the string enters C as UTF-8 string (as intended), but when it passes through _cpy_str_to_utf16 in usbus_control.c, each byte is emitted as the low half of a UTF-16 word (yes, that's what USB is using internally ... not so 21st century any more, but at least it's real UTF-16 and not UCS-2).

What I'd expect to happen

Yeah, if it were that easy I'd send a PR rather than a rant.

Options are:

Do nothing, and point out in the documentation that only ASCII is allowed.
Do nothing, and point out in the documentation that only latin1 is allowed (which, by construction of Unicode code points 128-255, also works -- but that's a character encoding I'd rather not have in 21st century documentation)
Don't copy over non-ASCII bytes (failing safe -- never mojibake, but it may go unnoticed)
Add Unicode support. It's not terribly much code (see the decode_code_point of https://gist.github.com/tylerneylon/9773800, and to support more than BMP, eg. 💾 inside a model name, needs another 3-or-so bit-shift-plus-addition lines and 4 rather than 2 calls to usb_control_slicer_put_char).
Add Unicode support but behind a pseudomodule.

@bergzand: preferences?

Quick copy-paste code

// copied over from https://gist.github.com/tylerneylon/9773800
// Stops at any null characters.
int decode_code_point(const char **s) {
  int k = **s ? __builtin_clz(~(**s << 24)) : 0;  // Count # of leading 1 bits.
  int mask = (1 << (8 - k)) - 1;                  // All 1's with k leading 0's.
  int value = **s & mask;
  for (++(*s), --k; k > 0 && **s; --k, ++(*s)) {  // Note that k = #total bytes, or 0.
    value <<= 6;
    value += (**s & 0x3F);
  }
  return value;
}

static size_t _cpy_str_to_utf16(usbus_t *usbus, const char *str)
{
    size_t len = 0;
    uint32_t unichar;

    while (*str) {
        unichar = decode_code_point(&str);

        if (unichar < 0x10000) {
            usbus_control_slicer_put_char(usbus, unichar & 0xff);
            usbus_control_slicer_put_char(usbus, unichar >> 8);
            len += 2;
        } else {
            // followeing https://en.wikipedia.org/wiki/UTF-16#Code_points_from_U+010000_to_U+10FFFF

            uint32_t u = (unichar - 0x10000) & 0xfffff;
            uint16_t w1 = 0xd800 + (u >> 10);
            uint16_t w2 = 0xdc00 + (u & 0x03ff);

            usbus_control_slicer_put_char(usbus, w1 & 0xff);
            usbus_control_slicer_put_char(usbus, w1 >> 8);
            usbus_control_slicer_put_char(usbus, w2 & 0xff);
            usbus_control_slicer_put_char(usbus, w2 >> 8);
            len += 4;
        }
    }
    return len;
}

The text was updated successfully, but these errors were encountered:

maribu added the Type: bug The issue reports a bug / The PR fixes a bug (including spelling errors) label May 22, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

USB identifiers with funny characters create mojibake #17776

USB identifiers with funny characters create mojibake #17776

chrysn commented Mar 8, 2022

USB identifiers with funny characters create mojibake #17776

USB identifiers with funny characters create mojibake #17776

Comments

chrysn commented Mar 8, 2022

Description

What I'd expect to happen