New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Functioning of our UTF16 encoding and endianess [request for discussion] #966

Closed
samcv opened this Issue Sep 16, 2018 · 4 comments

Comments

Projects
None yet
4 participants
@samcv
Member

samcv commented Sep 16, 2018

So currently writing a utf16 file will only write using the current machine's endianess. This is probably not a bug, but I just added the ability to write to utf16 files yesterday. Similar to how the current utf16 .encode and .decode work, basically if we are on a little endian or big endian machine it defaults to the machine order unless it sees a byte order mark.

This is more problematic when we are doing decodestream because we are not always at the start of the stream. Do we want to set some persistent flag if we see a BOM? What if the person seeks somewhere in the file before reading?

@jnthn what are your thoughts on this. I plan on adding support for explicit utf16le and utf16be as well. That solves many of these issues, though...

  • If we see a byte order mark at the start of a file, we should remove it even on explicit mode
  • But if we use decodestream, we may not be at the start of the file. Do we just remove any BOM we see anywhere in the file?

People's input is requested.

@Tux

This comment has been minimized.

Show comment
Hide comment
@Tux

Tux Sep 16, 2018

I think I expect perl6 to respect a BOM. Always (except when explicitly told not to). I spoke to others and all that cared about BOM's wish perl6 automatically deals with them at the beginning of a stream/file. Most had no idea that a BOM could also occur halfway a stream (as ZERO WIDTH NO-BREAK SPACE) so they have no idea what to do with them. Unicode says they can be ignored.
Yes, seeking is a problem.

Tux commented Sep 16, 2018

I think I expect perl6 to respect a BOM. Always (except when explicitly told not to). I spoke to others and all that cared about BOM's wish perl6 automatically deals with them at the beginning of a stream/file. Most had no idea that a BOM could also occur halfway a stream (as ZERO WIDTH NO-BREAK SPACE) so they have no idea what to do with them. Unicode says they can be ignored.
Yes, seeking is a problem.

@b2gills

This comment has been minimized.

Show comment
Hide comment
@b2gills

b2gills Sep 16, 2018

There is also the possibility that someone seeks past binary stuff to the start of a utf16 string. In that case they would want BOM treated as it would at the beginning of the file.

b2gills commented Sep 16, 2018

There is also the possibility that someone seeks past binary stuff to the start of a utf16 string. In that case they would want BOM treated as it would at the beginning of the file.

@samcv

This comment has been minimized.

Show comment
Hide comment
@samcv

samcv Sep 16, 2018

Member

@Tux so wikipedia says: "The standard also allows the byte order to be stated explicitly by specifying UTF-16BE or UTF-16LE as the encoding type. When the byte order is specified explicitly this way, a BOM is specifically not supposed to be prepended to the text, and a U+FEFF at the beginning should be handled as a ZWNBSP character. Most applications ignore a BOM in all cases despite this rule. "

So with utf16le or utf16be should we just ignore the BOM or should we pass it through as zero width no-break space?

So far we:

utf16 mode

  • utf16 mode will detect the BOM if starting from the beginning of the stream and set the mode accordingly. Anywhere else it will be passed through unchanged
  • If there is a BOM in utf16 mode it will NOT be passed through to perl6 land
  • If there is no BOM we use the current system's architecture

utf16le/be

Either:

  • Pass through BOM if there is one at start of stream as zero width no-break space
    OR
  • Don't pass it through, ignore it as many other applications do

OR we could have utf16 mode as specified above, and utf16le and utf16be modes which either ignore or don't ignore the BOM if present.

Thoughts?

Member

samcv commented Sep 16, 2018

@Tux so wikipedia says: "The standard also allows the byte order to be stated explicitly by specifying UTF-16BE or UTF-16LE as the encoding type. When the byte order is specified explicitly this way, a BOM is specifically not supposed to be prepended to the text, and a U+FEFF at the beginning should be handled as a ZWNBSP character. Most applications ignore a BOM in all cases despite this rule. "

So with utf16le or utf16be should we just ignore the BOM or should we pass it through as zero width no-break space?

So far we:

utf16 mode

  • utf16 mode will detect the BOM if starting from the beginning of the stream and set the mode accordingly. Anywhere else it will be passed through unchanged
  • If there is a BOM in utf16 mode it will NOT be passed through to perl6 land
  • If there is no BOM we use the current system's architecture

utf16le/be

Either:

  • Pass through BOM if there is one at start of stream as zero width no-break space
    OR
  • Don't pass it through, ignore it as many other applications do

OR we could have utf16 mode as specified above, and utf16le and utf16be modes which either ignore or don't ignore the BOM if present.

Thoughts?

@samcv

This comment has been minimized.

Show comment
Hide comment
@samcv

samcv Sep 30, 2018

Member

Going to close this since it's been implemented.

utf16le and utf16be will treat a BOM as a zero width no-break space, while utf16 will use the BOM to set the endianess and not pass it through.

Member

samcv commented Sep 30, 2018

Going to close this since it's been implemented.

utf16le and utf16be will treat a BOM as a zero width no-break space, while utf16 will use the BOM to set the endianess and not pass it through.

@samcv samcv closed this Sep 30, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment