Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with
or
.
Download ZIP

Loading…

Wrong display for non-ASCII tags #54

Open
sciurius opened this Issue · 18 comments

3 participants

@sciurius

In my collection, I have an album from Sinéad O'Connor.

In the recently added (and recently played, ...) views, this displays as Sinéad O'Connor.
When the album itself is selected, the artist name displays okay.

@timoreimann
Collaborator

Are you referring to the Android app, the web frontend, or both?

@porkcharsui
Owner

I am seeing something very similar in my collection with Sigur Rós which appears as Sigur Rós in both the web and Android clients. The song result itself does not show this error, just the directory name and artist search results, leading me to believe this is isolated to portions of the backend's use of character encoding.

@timoreimann
Collaborator

portcharsui's assumption seems correct: I tested things myself by creating an album which contains a ó character in both album and artist file and tag names, respectively. Result: Everything looks fine on both the web and Android client.

My Linux box is using UTF-8 for all locale settings, including LANG and LC_ALL. Could anyone affected by the issue please check his locale and see if switching to UTF-8 (possibly by using one of these methods) helps? (Don't forget to re-encode the ID3 portion too. I use eyeD3 for that.)

@timoreimann
Collaborator

Closing the issue as it seems fixed. Please report back if the problem persists in a reproducible way independent of a correctly configured locale environment.

@sciurius

http://www.squirrel.nl/pub/xfer/uploads/3CAcxyTsu3l9UgXeWkLkIoyQ.png

This is the current (as of yesterday) git version.

ID3 data for this track:

id3v1 tag info for 01_4th_And_Vine.mp3:
Title : 4th And Vine Artist: Sinéad O'Connor

Album : How About I Be Me, And You Be Year: , Genre: Unknown (255)
Comment: Track: 1
id3v2 tag info for 01_4th_And_Vine.mp3:
TALB (Album/Movie/Show title): How About I Be Me, And You Be You
TCON (Content type): (255) (255)
TIT2 (Title/songname/content description): 4th And Vine
TPE1 (Lead performer(s)/Soloist(s)): Sinéad O'Connor
TRCK (Track number/Position in set): 1

@timoreimann timoreimann reopened this
@timoreimann timoreimann was assigned
@timoreimann
Collaborator

@sciurius, what OS are you running? If Linux, can you provide the output of the locale command?

@sciurius

I think Fedora17 qualifies as Linux ;)

% cat /etc/redhat-release
Fedora release 17 (Beefy Miracle)
% uname -a
Linux phoenix.squirrel.nl 3.5.0-2.fc17.x86_64 #1 SMP Mon Jul 30 14:48:59 UTC 2012 x86_64 x86_64 x86_64 GNU/Linux
% locale
LANG=en_US.utf-8
LC_CTYPE="en_US.utf-8"
LC_NUMERIC="en_US.utf-8"
LC_TIME=en_GB
LC_COLLATE="en_US.utf-8"
LC_MONETARY="en_US.utf-8"
LC_MESSAGES="en_US.utf-8"
LC_PAPER=nl_NL
LC_NAME="en_US.utf-8"
LC_ADDRESS="en_US.utf-8"
LC_TELEPHONE="en_US.utf-8"
LC_MEASUREMENT="en_US.utf-8"
LC_IDENTIFICATION="en_US.utf-8"
LC_ALL=

@timoreimann
Collaborator

Adding one such affected MP3 file to my Supersonic instance, I can confirm that the artist's name looks borked. However, when I edit the MP3 and re-set the artist's name (by means of copying it off Wikipedia and pasting it into the artist field using an MP3 tag editor on my Mac), it looks perfectly right afterwards in Supersonic: All non-ASCII characters are rendered properly.

So, my conclusion is that it is not a Supersonic issue. Apparently, an invalid tag encoding has been applied in the first place.

sciurius, can you try to reproduce what I did and see if it confirms my findings?

@sciurius

Adding one such affected MP3 file to my Supersonic instance, I can
confirm that the artist's name looks borked. However, when I edit the
MP3 and re-set the artist's name (by means of copying it off Wikipedia
and pasting it into the artist field using an MP3 tag editor on my
Mac), it looks perfectly right afterwards in Supersonic: All non-ASCII
characters are rendered properly.

Yes, it is relatively easy to avoid this from happening. But this does
not make the problem (= a bug in the Su*Sonic code) go away. Only the
symptoms disappear.

So, my conclusion is that it is not a Supersonic issue. Apparently, an
invalid tag encoding has been applied in the first place.

Even if this were the case, the same tag is rendered differently in two
places. That indicates a coding error somewhere.

E.g., when I search for "o'connor" I get the following results:

Play Add Download Sinéad O'Connor
Play Add Download Tony O'Connor
Play Add Download Sinéad O'Connor
Play Add Download Yo-Yo Ma, Edgar Meyer, Mark O'Connor

So there are two distinct but visually identical results for Sinéad
O'Connor. The first result has 6 CD's, the second result has 3 CD's.

The same tracks (files) are processed by the Logitech Media Server
(a.k.a. SqueezeBoxServer) and this server has no problems with the tags.

@timoreimann
Collaborator

The same tag isn't rendered differently in Supersonic. Instead, Supersonic uses the directory name for the album name in the breadcrumb navigation, and the tag information for the playlist content. That's why the same label is shown properly one time and improperly a second time. So it's not a bug in Supersonic, just different approaches to achieve the same thing. (Admittedly, one can argue that the implementation should be more consistent.)

I'm not too familiar with Squeeze Box so I cannot tell why (and how) it does better. Maybe it reads the artist's name from another well-encoded field (like album artist) and prefers that one over the broken one. Or it uses the file name. Or, if encoding schemas are stored within MP3 tags along the payload (which I need to figure out), it may decode the content accordingly. I'll need to dig somewhat into the MP3 specs to tell what the best approach for Supersonic would be.

@sciurius
@timoreimann
Collaborator

I agree with what you say regarding not to use directory names for information display. To my defense, I didn't implement it this way, and wouldn't do so. :) I'll open a separate issue later to deal with this. In general, Su*sonic lacks capability to fully leverage tag data which is a shortcoming IMHO.

I took just a rough look at the MP3 specification: Apparently, it does include encoding information, so it should be possible to decode tag data appropriately (unless, of course, for those cases where the encoding is completely broken). We'll use this issue to indicate progress on the problem.

@timoreimann
Collaborator

I dug a little deeper into the Supersonic code: Apparently, it uses the Jaudiotagger library (in a fairly recent version) to decode audio tag data. One interesting bug report (JAUDIOTAGGER-179) deals with the request to allow proper decoding of MP3 tags in local encodings (like code pages). The Jaudiotagger author clearly states that anything but ISO-8859-1, variants of UTF-16, and UTF-8 (depending on the exact MP3v2 version being used) is illegal and won't be supported.

It looks like auto-detecting a local encoding is not possible (at least not in a standard-compliant fashion) whereas the supported encodings identify themselves by means of a marking encoding byte (see Wikipedia). I haven't looked into the Jaudiotagger code but strongly assume that decoding standard-compliant encodings works, so the observed behavior is most likely due to one such illegal local encoding. While the Jaudiotagger author announced to add an option to specify the (local) encoding used when decoding an MP3 tag, it wouldn't be a great solution IMHO because it'd only work for people willing and capable of specifying the employed encoding. Re-writing tag data in a supported encoding seems to be the way to go.

So much to what I figured about Jaudiotagger. Next, I'll try to find out what SqueezeBox Server does.

@sciurius
@timoreimann
Collaborator

Jaudiotagger does not seem to provide any tools. However, putting together a piece of code to read out the raw tag data should be a matter of a couple lines only; the Jaudiotagger code examples may be helpful with that.

Note that UTF-8 is allowed in ID3v2.4 only. It's probably safer to use UTF-16 with ID3v2.3.

@timoreimann
Collaborator

In order to take a closer look at the ID3 data, I build a simple program using Jaudiotagger 2.0.4. It looks like this:

import org.jaudiotagger.audio.AudioFileIO;
import org.jaudiotagger.audio.mp3.MP3File;
import org.jaudiotagger.tag.FieldKey;
import org.jaudiotagger.tag.Tag;
import org.jaudiotagger.tag.id3.*;

import java.io.File;
import java.nio.charset.Charset;


public class ParseMp3 {
    private static String mp3File = "01_4th_And_Vine.mp3";

    public static void main(String[] args) {
        File file = new File(ParseMp3.mp3File);
        MP3File mp3File = null;
        try {
            mp3File = (MP3File)AudioFileIO.read(file);
        } catch (Exception e) {
            e.printStackTrace();
        }

        Tag tag = mp3File.getTag();
        System.out.println("artist: " + tag.getFirst(FieldKey.ARTIST));

        ID3v1Tag id3v1Tag = mp3File.getID3v1Tag();
        System.out.println("ID3v1 artist: " + id3v1Tag.getFirst(FieldKey.ARTIST));

        AbstractID3v2Tag id3v23Tag = mp3File.getID3v2Tag();
        System.out.println("ID3v2.3 artist: " + id3v23Tag.getFirst(ID3v23Frames.FRAME_ID_V3_ARTIST));
        AbstractID3v2Frame id3v23Frame = id3v23Tag.getFirstField(ID3v23Frames.FRAME_ID_V3_ARTIST);
        System.out.println("ID3v2.3 encoding: " + id3v23Frame.getEncoding());

        ID3v24Tag id3v24Tag = mp3File.getID3v2TagAsv24();
        System.out.println("ID3v2.4 artist: " + id3v24Tag.getFirst(ID3v24Frames.FRAME_ID_ARTIST));
        AbstractID3v2Frame id3v24Frame = id3v24Tag.getFirstField(ID3v24Frames.FRAME_ID_ARTIST);
        System.out.println("ID3v2.4 encoding: " + id3v24Frame.getEncoding());

        byte[] rawData = id3v23Frame.getRawContent();
        System.out.println("analyzing " + rawData.length + " bytes of raw ID3v2.3 artist data:");
        String isoText = new String(rawData, Charset.forName("ISO-8859-1"));
        String utfText = new String(rawData, Charset.forName("UTF-8"));
        for (int i = 0; i < rawData.length; ++i) {
            char singleByte = (char)rawData[i];
            char utfChar = '\0';
            try {
                utfChar = utfText.charAt(i);
            } catch (StringIndexOutOfBoundsException exception) {
            }
            System.out.println("ID3v2 artist raw byte #" + String.format("%02d", i+1) + ": " +
                    String.format("%5d", (int)singleByte) + "\t\thex: " +
                    String.format("%8s", Integer.toHexString(singleByte)) + "\t\tISO: " +
                    isoText.charAt(i) + "\t\tUTF: " + utfChar);
        }
    }
}

The program's output is this:

artist: Sinéad O'Connor
ID3v1 artist: Sinéad O'Connor
ID3v2.3 artist: Sinéad O'Connor
ID3v2.3 encoding: ISO-8859-1
ID3v2.4 artist: Sinéad O'Connor
ID3v2.4 encoding: ISO-8859-1
analyzing 27 bytes of raw ID3v2.3 artist data:
ID3v2 artist raw byte #01:    84        hex:       54       ISO: T      UTF: T
ID3v2 artist raw byte #02:    80        hex:       50       ISO: P      UTF: P
ID3v2 artist raw byte #03:    69        hex:       45       ISO: E      UTF: E
ID3v2 artist raw byte #04:    49        hex:       31       ISO: 1      UTF: 1
ID3v2 artist raw byte #05:     0        hex:        0       ISO:        UTF: 
ID3v2 artist raw byte #06:     0        hex:        0       ISO:        UTF: 
ID3v2 artist raw byte #07:     0        hex:        0       ISO:        UTF: 
ID3v2 artist raw byte #08:    17        hex:       11       ISO: �      UTF: �
ID3v2 artist raw byte #09:     0        hex:        0       ISO:        UTF: 
ID3v2 artist raw byte #10:     0        hex:        0       ISO:        UTF: 
ID3v2 artist raw byte #11:     0        hex:        0       ISO:        UTF: 
ID3v2 artist raw byte #12:    83        hex:       53       ISO: S      UTF: S
ID3v2 artist raw byte #13:   105        hex:       69       ISO: i      UTF: i
ID3v2 artist raw byte #14:   110        hex:       6e       ISO: n      UTF: n
ID3v2 artist raw byte #15: 65475        hex:     ffc3       ISO: Ã     UTF: é
ID3v2 artist raw byte #16: 65449        hex:     ffa9       ISO: ©     UTF: a
ID3v2 artist raw byte #17:    97        hex:       61       ISO: a      UTF: d
ID3v2 artist raw byte #18:   100        hex:       64       ISO: d      UTF:  
ID3v2 artist raw byte #19:    32        hex:       20       ISO:        UTF: O
ID3v2 artist raw byte #20:    79        hex:       4f       ISO: O      UTF: '
ID3v2 artist raw byte #21:    39        hex:       27       ISO: '      UTF: C
ID3v2 artist raw byte #22:    67        hex:       43       ISO: C      UTF: o
ID3v2 artist raw byte #23:   111        hex:       6f       ISO: o      UTF: n
ID3v2 artist raw byte #24:   110        hex:       6e       ISO: n      UTF: n
ID3v2 artist raw byte #25:   110        hex:       6e       ISO: n      UTF: o
ID3v2 artist raw byte #26:   111        hex:       6f       ISO: o      UTF: r
ID3v2 artist raw byte #27:   114        hex:       72       ISO: r      UTF: 

Process finished with exit code 0

(I parsed ID3v2.3 data in the example but the results are exactly the same with 2.4.)

According to the ID3v2 frame specification, the bytes outlined above carry the following semantics:

  • bytes 1-4: The frame ID (TPE1, the Lead performer(s)/Soloist(s))
  • bytes 5-8: The frame size (17)
  • byte 9: The character encoding, in this case ISO-8859-1 as per the spec's wording: Frames that allow different types of text encoding have a text encoding description byte directly after the frame size. If ISO-8859-1 is used this byte should be $00, if Unicode is used it should be $01.
  • byte 10-11: flags
  • remaining bytes: payload (the artist text)

As you can see, the artist name was supposedly stored in ISO-8859-1 (byte 00 after the frame size). Taking a closer look at bytes 15 and 16, however, one can observe that in fact UTF-8 was written to the ID3 field: The é character takes up two bytes in UTF-8 which, if mistakenly interpreted as ISO-8859-1, leads to two distinct characters à and ©. (This seems to be a common symptom for UTF-8/ISO-8859-1-related issues.) Another indication for the lack of proper UTF-8 encoding is the fact that the BOM (either FFFE or FEFF) is missing.

So Jaudiotagger seems to be doing everything right. I can't tell for sure whether those other ID3-reading applications you are using are simply lucky to, say, choose your machine's locale setting for decoding, or sophisticated enough to try some kind of encoding detection. While the latter still seems desirable to have in Supersonic for certain cases, I believe that in this one the ID3 encoding is indeed broken.

HTH,

--Timo

@sciurius
@timoreimann
Collaborator

I was curious and gave juniversalchardet a try, a Java port of Mozilla's Universal Charset Detection implementation. Although the project looks a little bit neglected it managed to detect UTF-8 in the malformed files correctly. Since there are so many illegal music files out there, I'd like to incorporate the library in a Supersonic branch and see how it works.

For the sake of completeness, I also tried out another Java alternative called jChardet. It failed to detect UTF-8, however.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Something went wrong with that request. Please try again.