Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problem with CopyTagsToComments #238

Closed
paulijar opened this issue Mar 14, 2020 · 10 comments
Closed

Problem with CopyTagsToComments #238

paulijar opened this issue Mar 14, 2020 · 10 comments

Comments

@paulijar
Copy link
Contributor

I think that the function CopyTagsToComments doesn't work as intended after 96430a1. I believe that the purpose of this function is to enable consistent tag retrieval for the application developers regardless of the tag format (ID3v1/2, APE, ...). Especially on files with both ID3v1 and ID3v2 tags, the function should make arbitration between the tag types and select the best available tag.

After the above mentioned commit, the result of CopyTagsToComments does not arbitrate between ID3v1 and ID3v2. Instead, the comments field will contain data from both tag types if there are any differences. On my tests, the data from ID3v1 always came first and the data from ID3v2 in the second cell. As my application has been expecting to get only single value, it is now using the ID3v1 tag, which is often the worse alternative: every field is truncated to 30 characters, and non-ASCII characters are corrupted, which might actually be another bug. I think that my ID3v1 tags contain properly ISO 8859-1 encoded accented characters like ä, ö, and é, but all of those get converted to some Cyrillic characters.

@JamesHeinrich
Copy link
Owner

Can you please provide a sample file for testing, either here or to info@getid3.org

@paulijar
Copy link
Contributor Author

I sent you a mail.

@JamesHeinrich
Copy link
Owner

Thank you for the sample file.

The CopyTagsToComments function will copy the best available data from all tag formats and you should be able to pick the first one as the best. If everything works as intended, that is. Normally ID3v1 vs ID3v2 shouldn't be an issue, since if the truncated ID3v1 is a subset of the longer ID3v2 tag value then only the longer (ID3v2 or APE) will be returned and the truncated version ignored. But of course in your example the ID3v1 and ID3v2 values are completely different due to incorrect encoding (falsely) detected for ID3v1, that's why two values are being returned.

The problem lies in the detection of "wrong" ID3v1 encodings in module.tag.id3v1.php lines 68-87. This code block is supposed to detect Cyrillic text in Windows-1251 or KOI8-R text encodings but it's clearly not working as intended in your example. I will need to look at the problem further to see if I can come up with a sensible solution. To a human who is familiar with the language it might be obvious as to whether it should read "Petäjäveräjät" or "Petдjдverдjдt", but there is no real definitive code way of determining the intended encoding automatically. Really there shouldn't be any need to guess, because all ID3v1 tags should be in ISO-8859-1 but there are a vast number of files in the wild that have been tagged according to "local" encoding standards for characters sets not supported by ISO-8859-1 (including Cyrillic).

I will ponder the problem further and post back with what I decide to do.

JamesHeinrich added a commit that referenced this issue Mar 15, 2020
Prevent ID3v1 encoding from being falsely detected as Windows-1251
#238
Add option encoding_id3v1_autodetect to disable auto-detecting ID3v1 encodings entirely
@JamesHeinrich
Copy link
Owner

Turns out there was actually a design flaw in the ID3v1 encoding autodetection. It should be fixed in 5100f75

Thanks for the bug report and sample file.

@paulijar
Copy link
Contributor Author

Thanks for the quick resolution.

However, there is still a related issue remaining with files tagged with non-Latin scripts. At least when using foobar2000 to tag files with such scripts, what happens, is that the ID3v1 tags will get filled with '?' characters which are used as substitution for those characters not supported in ISO-8859-1. Then after CopyTagsToComments, those replacement strings are returned before the proper content extracted from ID3v2 tags.

Here is an example file, where I entered some Cyrillic, Arabic, and Chinese characters to the tags using foobar2000:
non_latin.zip

JamesHeinrich added a commit that referenced this issue Mar 15, 2020
CopyTagsToComments should ignore badly-translitterated ID3v1 tags when they appear to be equivalent to the value stored in another tag format that supports multibyte characters. For example, tagging a file with "中文标题" in Foobar2000 will store "????" as the title in ID3v1.
#238
@JamesHeinrich
Copy link
Owner

Thanks again for the new sample file. This should be resolved in 4b3c92a

@paulijar
Copy link
Contributor Author

Great, thanks! Works for me now.

As a minor nitpick, the Cyrillic Artist field from the non_latin.mp3 is still shown with two values, second of which is "???????????". For the Chinese and Arabic fields of the file, those ?-filled fields are omitted. But as my application is anyway selecting the first field of each type, it doesn't actually matter in my case that there is some garbage after the correct tag.

@JamesHeinrich
Copy link
Owner

JamesHeinrich commented Mar 15, 2020

One change included in my above commit is that ID3v1 is forced to be processed last, which makes it easier for me to detect bad transliterations in ID3v1, but as a side effect means that you're more likely to get a "good" variant in the first position of [comments] if for some reason multiple representations are not auto-ignored.

In my testing your sample file the Album (Arabic), Artist (Cyrillic), and Title (Chinese) are all correctly detected and appropriately ignored. I'm not sure why it would be different on your system.
The Cyrillic does appear to contain a non-representable character however, perhaps that's part of the problem. getID3 and MP3tag show it as "ЎВЩЮТӮӴԒԘԑԍ" (note the 3rd-last character). Winamp displays blank for the ID3v2 Artist field (presumably due to the invalid character).

I also made a small change in 9c67941 related to a potential mismatch between comments and comments_html, but that's unrelated to the Cyrillic artist issue.

@paulijar
Copy link
Contributor Author

That "Ԙ" is indeed quite exotic character and maybe it's not universally supported for this reason. According to Wikipedia, it's only used in some non-Slavic minority languages within Russia.

That, however, is not the significant factor here. It turned out, that the question marks are shown on any Cyrillic field if and only if the option getID3->encoding_id3v1_autodetect = true has been given. With the default value false, only the proper Cyrillic data from ID3v2 is shown.

@JamesHeinrich
Copy link
Owner

Ignore my comments about "Ԙ", apparently that character isn't present in fonts that ship with Windows 7, on which I was testing this (including the Wikipedia article). I checked on a Win10 machine and it shows up fine.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants