Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
Functioning of our UTF16 encoding and endianess [request for discussion] #966
So currently writing a utf16 file will only write using the current machine's endianess. This is probably not a bug, but I just added the ability to write to utf16 files yesterday. Similar to how the current utf16 .encode and .decode work, basically if we are on a little endian or big endian machine it defaults to the machine order unless it sees a byte order mark.
This is more problematic when we are doing decodestream because we are not always at the start of the stream. Do we want to set some persistent flag if we see a BOM? What if the person seeks somewhere in the file before reading?
@jnthn what are your thoughts on this. I plan on adding support for explicit utf16le and utf16be as well. That solves many of these issues, though...
People's input is requested.
I think I expect perl6 to respect a BOM. Always (except when explicitly told not to). I spoke to others and all that cared about BOM's wish perl6 automatically deals with them at the beginning of a stream/file. Most had no idea that a BOM could also occur halfway a stream (as ZERO WIDTH NO-BREAK SPACE) so they have no idea what to do with them. Unicode says they can be ignored.
@Tux so wikipedia says: "The standard also allows the byte order to be stated explicitly by specifying UTF-16BE or UTF-16LE as the encoding type. When the byte order is specified explicitly this way, a BOM is specifically not supposed to be prepended to the text, and a U+FEFF at the beginning should be handled as a ZWNBSP character. Most applications ignore a BOM in all cases despite this rule. "
So with utf16le or utf16be should we just ignore the BOM or should we pass it through as zero width no-break space?
So far we:
OR we could have utf16 mode as specified above, and utf16le and utf16be modes which either ignore or don't ignore the BOM if present.