UTF-16 Encoding Issue #86

hut8 · 2023-11-20T01:24:05Z

Thanks for a great library! I appreciate your hard work on it.

Like in my last issue (which is a distinct encoding issue), I'm running into more trouble with encoding.

Note that I'm calling SetVersion prior to SetDefaultEncoding because of #85

tags.SetVersion(3)                           // id3v2.3.0
tags.SetDefaultEncoding(id3v2.EncodingUTF16) // UTf-8 would be cool, but 2.3 doesn't support it
tags.SetTitle("Мир! Вашему! Дому! (K-DEF Remix)")

When I run exiftool -v3 -l myfile.mp3, it indicates this for TIT2

  |     009c: 01 fe ff 04 1c 04 38 04 40 00 21 00 20 04 12 04 [......8.@.!. ...]
  |     00ac: 30 04 48 04 35 04 3c 04 43 00 21 00 20 04 14 04 [0.H.5.<.C.!. ...]
  |     00bc: 3e 04 3c 04 43 00 21 00 20 00 28 00 4b 00 2d 00 [>.<.C.!. .(.K.-.]
  |     00cc: 44 00 45 00 46 00 20 00 52 00 65 00 6d 00 69 00 [D.E.F. .R.e.m.i.]
  |     00dc: 78 00 29 00 00 00                               [x.)...]

I see that the first 0x01 indicates that we're dealing with UCS-2 with a BOM. Let's disregard that. Then the BOM says it's big endian (fine). I edited out the 0x01 tag in a text editor then pasted it into an interactive python session:

>>> hexstr="""
... fe ff 04 1c 04 38 04 40 00 21 00 20 04 12 04
... 30 04 48 04 35 04 3c 04 43 00 21 00 20 04 14 04
... 3e 04 3c 04 43 00 21 00 20 00 28 00 4b 00 2d 00
... 44 00 45 00 46 00 20 00 52 00 65 00 6d 00 69 00
... 78 00 29 00 00 00
... """
>>> raw = bytes([int(x, 16) for x in hexstr.replace("\n"," ").split(" ") if x != ""])
>>> raw.decode("utf16")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf-16-be' codec can't decode byte 0x00 in position 68: truncated data
>>> len(raw)
69

Kinda hard to get to the standard at the moment (it's been like this for a while):

Here's a copy: https://web.archive.org/web/20190207033339/https://id3.org/id3v2.3.0#ID3v2_frame_overview

At least in 2.3.0, it does say that you're supposed to have a null terminator (so, 00 00 for UTF-16/UCS-2). I don't think that you can have a UCS-2 string with an odd number of characters. So it seems like you're appending an additional null character, and the length becomes not a multiple of 2, which makes it not legit UCS-2 or UTF-16.

I'm pasting a screenshot rather than copying the text of the output of id3edit because I think the colors are cool:

So that thinks it's bad too. If I just remove the one extra null byte, python seems happy with it (and other tools):

>>> hexstr = """
... fe ff 04 1c 04 38 04 40 00 21 00 20 04 12 04
... 30 04 48 04 35 04 3c 04 43 00 21 00 20 04 14 04
... 3e 04 3c 04 43 00 21 00 20 00 28 00 4b 00 2d 00
... 44 00 45 00 46 00 20 00 52 00 65 00 6d 00 69 00
... 78 00 29 00 00
... """
>>> raw = bytes([int(x, 16) for x in hexstr.replace("\n"," ").split(" ") if x != ""])
>>> raw.decode("utf16")
'Мир! Вашему! Дому! (K-DEF Remix)\x00'

I think this has something to do with other PRs that were meant to fix things for other encodings maybe in 2.4...

The text was updated successfully, but these errors were encountered:

hut8 · 2023-11-20T01:36:57Z

This looks relevant:

id3v2/encoding.go

Lines 162 to 164 in 34286c4

    
           if to.Equals(EncodingUTF16) && !bytes.HasSuffix([]byte(encoded), []byte{0}) { 
        
           	bw.WriteByte(0) 
        
           }

Also, oddly, in your code,

// bom is used in UTF-16 encoded Unicode with BOM.
// See https://en.wikipedia.org/wiki/Byte_order_mark.
var bom = []byte{0xFF, 0xFE}

This is the little-endian BOM. Not big-endian. But your program seems to write big-endian BOM (and big-endian text).

hut8 · 2023-11-20T01:50:17Z

Alright I think I got it:

id3v2/text_frame.go

Lines 24 to 33 in 34286c4

    
           func (tf TextFrame) WriteTo(w io.Writer) (int64, error) { 
        
           	return useBufWriter(w, func(bw *bufWriter) { 
        
           		bw.WriteByte(tf.Encoding.Key) 
        
           		bw.EncodeAndWriteText(tf.Text, tf.Encoding) 
        
           		// https://github.com/bogem/id3v2/pull/52 
        
           		// https://github.com/bogem/id3v2/pull/33 
        
           		bw.Write(tf.Encoding.TerminationBytes) 
        
           	}) 
        
           }

When I call tf.WriteTo(), it calls bw.EncodeAndWriteText. Follow along, it will add a single 0 -

func (bw *bufWriter) EncodeAndWriteText(src string, to Encoding) {
	if bw.err != nil {
		return
	}

	bw.err = encodeWriteText(bw, src, to)
}

encodeWriteText does this:

// encodeWriteText encodes src from UTF-8 to "to" encoding and writes to bw.
func encodeWriteText(bw *bufWriter, src string, to Encoding) error {
	if to.Equals(EncodingUTF8) {
		bw.WriteString(src)
		return nil
	}

	toXEncoding := resolveXEncoding(nil, to)
	encoded, err := toXEncoding.NewEncoder().String(src)
	if err != nil {
		return err
	}

	bw.WriteString(encoded)

        // Here we go! 💣
	if to.Equals(EncodingUTF16) && !bytes.HasSuffix([]byte(encoded), []byte{0}) {
		bw.WriteByte(0)
	}

	return nil
}

So at this point, before encodeWriteText is called, it doesn't have the termination characters added. Therefore that if clause: !bytes.HasSuffix([]byte(encoded), []byte{0}) is true. But really that entire if statement is wrong: a single 0 in UTF-16 LE at the end of any ASCII string that's been encoded as UTF-16 will be present. I think this should be removed entirely.

After encodeWriteText is called, now we have a single 0 at the end, which will give us an odd number of bytes (invalid UCS-2/UTF-16). Then back to TextFrame.WriteTo():

func (tf TextFrame) WriteTo(w io.Writer) (int64, error) {
	return useBufWriter(w, func(bw *bufWriter) {
		bw.WriteByte(tf.Encoding.Key)
		bw.EncodeAndWriteText(tf.Text, tf.Encoding) // <- this added a single 0

		// https://github.com/bogem/id3v2/pull/52
		// https://github.com/bogem/id3v2/pull/33
		bw.Write(tf.Encoding.TerminationBytes) // <- now we have two more
	})
}

There you go, three null bytes at the end. So there's a bug with UTF-16, and that's consistent with the tools I'm using to read it.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UTF-16 Encoding Issue #86

UTF-16 Encoding Issue #86

hut8 commented Nov 20, 2023

hut8 commented Nov 20, 2023 •

edited

hut8 commented Nov 20, 2023 •

edited

UTF-16 Encoding Issue #86

UTF-16 Encoding Issue #86

Comments

hut8 commented Nov 20, 2023

hut8 commented Nov 20, 2023 • edited

hut8 commented Nov 20, 2023 • edited

hut8 commented Nov 20, 2023 •

edited

hut8 commented Nov 20, 2023 •

edited