Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UTF-16 Encoding Issue #86

Open
hut8 opened this issue Nov 20, 2023 · 2 comments
Open

UTF-16 Encoding Issue #86

hut8 opened this issue Nov 20, 2023 · 2 comments

Comments

@hut8
Copy link

hut8 commented Nov 20, 2023

Thanks for a great library! I appreciate your hard work on it.

Like in my last issue (which is a distinct encoding issue), I'm running into more trouble with encoding.

Note that I'm calling SetVersion prior to SetDefaultEncoding because of #85

tags.SetVersion(3)                           // id3v2.3.0
tags.SetDefaultEncoding(id3v2.EncodingUTF16) // UTf-8 would be cool, but 2.3 doesn't support it
tags.SetTitle("Мир! Вашему! Дому! (K-DEF Remix)")

When I run exiftool -v3 -l myfile.mp3, it indicates this for TIT2

  |     009c: 01 fe ff 04 1c 04 38 04 40 00 21 00 20 04 12 04 [......8.@.!. ...]
  |     00ac: 30 04 48 04 35 04 3c 04 43 00 21 00 20 04 14 04 [0.H.5.<.C.!. ...]
  |     00bc: 3e 04 3c 04 43 00 21 00 20 00 28 00 4b 00 2d 00 [>.<.C.!. .(.K.-.]
  |     00cc: 44 00 45 00 46 00 20 00 52 00 65 00 6d 00 69 00 [D.E.F. .R.e.m.i.]
  |     00dc: 78 00 29 00 00 00                               [x.)...]

I see that the first 0x01 indicates that we're dealing with UCS-2 with a BOM. Let's disregard that. Then the BOM says it's big endian (fine). I edited out the 0x01 tag in a text editor then pasted it into an interactive python session:

>>> hexstr="""
... fe ff 04 1c 04 38 04 40 00 21 00 20 04 12 04
... 30 04 48 04 35 04 3c 04 43 00 21 00 20 04 14 04
... 3e 04 3c 04 43 00 21 00 20 00 28 00 4b 00 2d 00
... 44 00 45 00 46 00 20 00 52 00 65 00 6d 00 69 00
... 78 00 29 00 00 00
... """
>>> raw = bytes([int(x, 16) for x in hexstr.replace("\n"," ").split(" ") if x != ""])
>>> raw.decode("utf16")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf-16-be' codec can't decode byte 0x00 in position 68: truncated data
>>> len(raw)
69

Kinda hard to get to the standard at the moment (it's been like this for a while):

image

Here's a copy: https://web.archive.org/web/20190207033339/https://id3.org/id3v2.3.0#ID3v2_frame_overview

At least in 2.3.0, it does say that you're supposed to have a null terminator (so, 00 00 for UTF-16/UCS-2). I don't think that you can have a UCS-2 string with an odd number of characters. So it seems like you're appending an additional null character, and the length becomes not a multiple of 2, which makes it not legit UCS-2 or UTF-16.

I'm pasting a screenshot rather than copying the text of the output of id3edit because I think the colors are cool:

image

So that thinks it's bad too. If I just remove the one extra null byte, python seems happy with it (and other tools):

>>> hexstr = """
... fe ff 04 1c 04 38 04 40 00 21 00 20 04 12 04
... 30 04 48 04 35 04 3c 04 43 00 21 00 20 04 14 04
... 3e 04 3c 04 43 00 21 00 20 00 28 00 4b 00 2d 00
... 44 00 45 00 46 00 20 00 52 00 65 00 6d 00 69 00
... 78 00 29 00 00
... """
>>> raw = bytes([int(x, 16) for x in hexstr.replace("\n"," ").split(" ") if x != ""])
>>> raw.decode("utf16")
'Мир! Вашему! Дому! (K-DEF Remix)\x00'

I think this has something to do with other PRs that were meant to fix things for other encodings maybe in 2.4...

@hut8
Copy link
Author

hut8 commented Nov 20, 2023

This looks relevant:

id3v2/encoding.go

Lines 162 to 164 in 34286c4

if to.Equals(EncodingUTF16) && !bytes.HasSuffix([]byte(encoded), []byte{0}) {
bw.WriteByte(0)
}

Also, oddly, in your code,

// bom is used in UTF-16 encoded Unicode with BOM.
// See https://en.wikipedia.org/wiki/Byte_order_mark.
var bom = []byte{0xFF, 0xFE}

This is the little-endian BOM. Not big-endian. But your program seems to write big-endian BOM (and big-endian text).

@hut8
Copy link
Author

hut8 commented Nov 20, 2023

Alright I think I got it:

id3v2/text_frame.go

Lines 24 to 33 in 34286c4

func (tf TextFrame) WriteTo(w io.Writer) (int64, error) {
return useBufWriter(w, func(bw *bufWriter) {
bw.WriteByte(tf.Encoding.Key)
bw.EncodeAndWriteText(tf.Text, tf.Encoding)
// https://github.com/bogem/id3v2/pull/52
// https://github.com/bogem/id3v2/pull/33
bw.Write(tf.Encoding.TerminationBytes)
})
}

When I call tf.WriteTo(), it calls bw.EncodeAndWriteText. Follow along, it will add a single 0 -

func (bw *bufWriter) EncodeAndWriteText(src string, to Encoding) {
	if bw.err != nil {
		return
	}

	bw.err = encodeWriteText(bw, src, to)
}

encodeWriteText does this:

// encodeWriteText encodes src from UTF-8 to "to" encoding and writes to bw.
func encodeWriteText(bw *bufWriter, src string, to Encoding) error {
	if to.Equals(EncodingUTF8) {
		bw.WriteString(src)
		return nil
	}

	toXEncoding := resolveXEncoding(nil, to)
	encoded, err := toXEncoding.NewEncoder().String(src)
	if err != nil {
		return err
	}

	bw.WriteString(encoded)

        // Here we go! 💣
	if to.Equals(EncodingUTF16) && !bytes.HasSuffix([]byte(encoded), []byte{0}) {
		bw.WriteByte(0)
	}

	return nil
}

So at this point, before encodeWriteText is called, it doesn't have the termination characters added. Therefore that if clause: !bytes.HasSuffix([]byte(encoded), []byte{0}) is true. But really that entire if statement is wrong: a single 0 in UTF-16 LE at the end of any ASCII string that's been encoded as UTF-16 will be present. I think this should be removed entirely.

After encodeWriteText is called, now we have a single 0 at the end, which will give us an odd number of bytes (invalid UCS-2/UTF-16). Then back to TextFrame.WriteTo():

func (tf TextFrame) WriteTo(w io.Writer) (int64, error) {
	return useBufWriter(w, func(bw *bufWriter) {
		bw.WriteByte(tf.Encoding.Key)
		bw.EncodeAndWriteText(tf.Text, tf.Encoding) // <- this added a single 0

		// https://github.com/bogem/id3v2/pull/52
		// https://github.com/bogem/id3v2/pull/33
		bw.Write(tf.Encoding.TerminationBytes) // <- now we have two more
	})
}

There you go, three null bytes at the end. So there's a bug with UTF-16, and that's consistent with the tools I'm using to read it.

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant