Problem with CopyTagsToComments #238

paulijar · 2020-03-14T21:16:38Z

I think that the function CopyTagsToComments doesn't work as intended after 96430a1. I believe that the purpose of this function is to enable consistent tag retrieval for the application developers regardless of the tag format (ID3v1/2, APE, ...). Especially on files with both ID3v1 and ID3v2 tags, the function should make arbitration between the tag types and select the best available tag.

After the above mentioned commit, the result of CopyTagsToComments does not arbitrate between ID3v1 and ID3v2. Instead, the comments field will contain data from both tag types if there are any differences. On my tests, the data from ID3v1 always came first and the data from ID3v2 in the second cell. As my application has been expecting to get only single value, it is now using the ID3v1 tag, which is often the worse alternative: every field is truncated to 30 characters, and non-ASCII characters are corrupted, which might actually be another bug. I think that my ID3v1 tags contain properly ISO 8859-1 encoded accented characters like ä, ö, and é, but all of those get converted to some Cyrillic characters.

The text was updated successfully, but these errors were encountered:

JamesHeinrich · 2020-03-14T22:09:09Z

Can you please provide a sample file for testing, either here or to info@getid3.org

paulijar · 2020-03-14T22:24:56Z

I sent you a mail.

JamesHeinrich · 2020-03-15T02:07:55Z

Thank you for the sample file.

The CopyTagsToComments function will copy the best available data from all tag formats and you should be able to pick the first one as the best. If everything works as intended, that is. Normally ID3v1 vs ID3v2 shouldn't be an issue, since if the truncated ID3v1 is a subset of the longer ID3v2 tag value then only the longer (ID3v2 or APE) will be returned and the truncated version ignored. But of course in your example the ID3v1 and ID3v2 values are completely different due to incorrect encoding (falsely) detected for ID3v1, that's why two values are being returned.

The problem lies in the detection of "wrong" ID3v1 encodings in module.tag.id3v1.php lines 68-87. This code block is supposed to detect Cyrillic text in Windows-1251 or KOI8-R text encodings but it's clearly not working as intended in your example. I will need to look at the problem further to see if I can come up with a sensible solution. To a human who is familiar with the language it might be obvious as to whether it should read "Petäjäveräjät" or "Petдjдverдjдt", but there is no real definitive code way of determining the intended encoding automatically. Really there shouldn't be any need to guess, because all ID3v1 tags should be in ISO-8859-1 but there are a vast number of files in the wild that have been tagged according to "local" encoding standards for characters sets not supported by ISO-8859-1 (including Cyrillic).

I will ponder the problem further and post back with what I decide to do.

Prevent ID3v1 encoding from being falsely detected as Windows-1251 #238 Add option encoding_id3v1_autodetect to disable auto-detecting ID3v1 encodings entirely

JamesHeinrich · 2020-03-15T02:40:20Z

Turns out there was actually a design flaw in the ID3v1 encoding autodetection. It should be fixed in 5100f75

Thanks for the bug report and sample file.

paulijar · 2020-03-15T09:36:39Z

Thanks for the quick resolution.

However, there is still a related issue remaining with files tagged with non-Latin scripts. At least when using foobar2000 to tag files with such scripts, what happens, is that the ID3v1 tags will get filled with '?' characters which are used as substitution for those characters not supported in ISO-8859-1. Then after CopyTagsToComments, those replacement strings are returned before the proper content extracted from ID3v2 tags.

Here is an example file, where I entered some Cyrillic, Arabic, and Chinese characters to the tags using foobar2000:
non_latin.zip

CopyTagsToComments should ignore badly-translitterated ID3v1 tags when they appear to be equivalent to the value stored in another tag format that supports multibyte characters. For example, tagging a file with "中文标题" in Foobar2000 will store "????" as the title in ID3v1. #238

JamesHeinrich · 2020-03-15T11:58:14Z

Thanks again for the new sample file. This should be resolved in 4b3c92a

paulijar · 2020-03-15T13:05:32Z

Great, thanks! Works for me now.

As a minor nitpick, the Cyrillic Artist field from the non_latin.mp3 is still shown with two values, second of which is "???????????". For the Chinese and Arabic fields of the file, those ?-filled fields are omitted. But as my application is anyway selecting the first field of each type, it doesn't actually matter in my case that there is some garbage after the correct tag.

#238

JamesHeinrich · 2020-03-15T13:40:04Z

One change included in my above commit is that ID3v1 is forced to be processed last, which makes it easier for me to detect bad transliterations in ID3v1, but as a side effect means that you're more likely to get a "good" variant in the first position of [comments] if for some reason multiple representations are not auto-ignored.

In my testing your sample file the Album (Arabic), Artist (Cyrillic), and Title (Chinese) are all correctly detected and appropriately ignored. I'm not sure why it would be different on your system.
The Cyrillic does appear to contain a non-representable character however, perhaps that's part of the problem. getID3 and MP3tag show it as "ЎВЩЮТӮӴԒԘԑԍ" (note the 3rd-last character). Winamp displays blank for the ID3v2 Artist field (presumably due to the invalid character).

I also made a small change in 9c67941 related to a potential mismatch between comments and comments_html, but that's unrelated to the Cyrillic artist issue.

paulijar · 2020-03-15T14:45:41Z

That "Ԙ" is indeed quite exotic character and maybe it's not universally supported for this reason. According to Wikipedia, it's only used in some non-Slavic minority languages within Russia.

That, however, is not the significant factor here. It turned out, that the question marks are shown on any Cyrillic field if and only if the option getID3->encoding_id3v1_autodetect = true has been given. With the default value false, only the proper Cyrillic data from ID3v2 is shown.

JamesHeinrich · 2020-03-15T15:10:52Z

Ignore my comments about "Ԙ", apparently that character isn't present in fonts that ship with Windows 7, on which I was testing this (including the Wikipedia article). I checked on a Win10 machine and it shows up fine.

JamesHeinrich added a commit that referenced this issue Mar 15, 2020

bugfix #238 Incorrect ID3v1 character encoding detection

5100f75

Prevent ID3v1 encoding from being falsely detected as Windows-1251 #238 Add option encoding_id3v1_autodetect to disable auto-detecting ID3v1 encodings entirely

JamesHeinrich closed this as completed Mar 15, 2020

JamesHeinrich added a commit that referenced this issue Mar 15, 2020

bugfix #238: comments_html may not match comments

9c67941

#238

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Problem with CopyTagsToComments #238

Problem with CopyTagsToComments #238

paulijar commented Mar 14, 2020

JamesHeinrich commented Mar 14, 2020

paulijar commented Mar 14, 2020

JamesHeinrich commented Mar 15, 2020

JamesHeinrich commented Mar 15, 2020

paulijar commented Mar 15, 2020

JamesHeinrich commented Mar 15, 2020

paulijar commented Mar 15, 2020

JamesHeinrich commented Mar 15, 2020 •

edited

Loading

paulijar commented Mar 15, 2020

JamesHeinrich commented Mar 15, 2020

Problem with CopyTagsToComments #238

Problem with CopyTagsToComments #238

Comments

paulijar commented Mar 14, 2020

JamesHeinrich commented Mar 14, 2020

paulijar commented Mar 14, 2020

JamesHeinrich commented Mar 15, 2020

JamesHeinrich commented Mar 15, 2020

paulijar commented Mar 15, 2020

JamesHeinrich commented Mar 15, 2020

paulijar commented Mar 15, 2020

JamesHeinrich commented Mar 15, 2020 • edited Loading

paulijar commented Mar 15, 2020

JamesHeinrich commented Mar 15, 2020

JamesHeinrich commented Mar 15, 2020 •

edited

Loading