Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

wrong encoding of national characters in file and folder names + folders disappearing #68

Closed
ppetak opened this issue Aug 21, 2012 · 14 comments

Comments

@ppetak
Copy link

ppetak commented Aug 21, 2012

Hi all,
I have problem with national characters. I found the same problem occurred for some people here, but none of their solutions worked for me, or thread is abandoned. My version is 4.7.beta1, I'm on arch linux using package from AUR (unsupported) repository. Well, the problem is, that:

  1. No folder with any national characters are displayed on left pane, and
  2. even that some files with national characters are displayed in main area, there are only question marks instead of national characters. If there is character ž in the name, file will NOT be displayed at all. There may be more exceptions like this. Characters in the web page IS in UTF-8 and it actually IS question mark character.

What I tried is that I have checked all language variables from which java should know to use en_US.UTF-8, ans display everything normally.

for example, and as proof my system is set correctly, this is terminal output. This file and directory is not displayed at all:

[root@xynns /]# ls /home/public/video/_ceske/Postřižiny/
Postřižiny.avi

However, in this example it is displayed, but with question marks.

[root@xynns /]# ls /home/public/video/_ceske/A\ bude\ hůř.avi 
/home/public/video/_ceske/A bude hůř.avi 

and this is subsonic output:

_ceske » A bude h����
Up | Play all | Play random | Add all | Comment

of course that such files cannot be played, found easily, etc. The file name is clearly in wrong shape in database, or read wrong in indexation phase (which has the same effect).

well, now, I have every LANG and LC_* variables as en_US.UTF-8. As I'm on arch and starting daemon via script, I'm also forcing LC_ALL=en_US.UTF-8 in start script.

ags:

-Xmx100m -Dsubsonic.home=/var/supersonic -Dsubsonic.host=0.0.0.0 -Dsubsonic.port=4141 -Dsubsonic.httpsPort=0 -Dsubsonic.contextPath=/ -Dsubsonic.defaultMusicFolder=/home/public/ -Dsubsonic.defaultPodcastFolder=/var/music/Podcast -Dsubsonic.defaultPlaylistFolder=/var/playlists -Djava.awt.headless=true -verbose -jar supersonic-booter-jar-with-dependencies.jar

It is very annoying bug, because many files are not included in indexing, but I don't know which.

@ppetak
Copy link
Author

ppetak commented Aug 21, 2012

well, to investigate this issue I have also asked on subsonic forum.
-- EDIT: accidentally closed - wrong click :)

@ppetak ppetak closed this as completed Aug 21, 2012
@ppetak ppetak reopened this Aug 21, 2012
@timoreimann
Copy link
Contributor

Issue #54 is still actively investigating the issue. I know that a number of people have complained about encoding problems, so I'm on it whenever time permits (which, unfortunately, is not too often at the moment). I suggest to follow #54 as well.

This Stackoverflow question describes a case where the page sent from a (XAMPP) web-server was not encoded in UTF-8 (while telling the browser that the encoding is UTF-8), thereby leading to the display of those pesky question marks. I am wondering if something similar happens in Supersonic.

What makes the issue even more complicated is the fact that Supersonic uses a combination of the directory name and tag data to convey information.

What application server are you running? Jetty or something else?

@timoreimann
Copy link
Contributor

Could you also please provide a link to the Subsonic forum post you mentioned?

@timoreimann
Copy link
Contributor

Nevermind, I think I found your post.

@ppetak
Copy link
Author

ppetak commented Aug 27, 2012

sorry, I was a week out. post on subsonic is my, you found it.
I'm using built-in server for subsonic, which is jetty if I get it right. I don't get the web server problem in Stackoverflow you mentioned. You think the non-UTF-8 characters are only display problem on the server-client communication?

What I found is this:

INSERT INTO MEDIA_FILE VALUES(40,'/home/public/video/_ceske/A bude h\ufffd\ufffd\ufffd\ufffd.avi','/home/public','DIRECTORY',NULL,NULL,NULL,'A bude h\ufffd\ufffd\ufffd\ufffd.avi',NULL,NULL,NULL,NULL,NULL,FALSE,NULL,NULL,NULL,NULL,NULL,'/home/public/video/_ceske',0,NULL,NULL,'1970-01-01 01:00:00.000000000','1970-01-01 01:00:00.000000000','2012-08-27 03:00:00.329000000','1970-01-01 01:00:00.000000000',TRUE,1)

It is from database log, here the problem is visible, so it exists at least from saving information in the database. Couldn't it be a problem with reading (or re-coding or whatever they do with it) directory content?

@timoreimann
Copy link
Contributor

I am no way certain that there is some UTF-8-encoding problem within the client/server communication as I haven't had the time to dive into the issue more deeply. What I consider more likely at this moment (because I spent quite some time on the encoding topic during issue #54) is that Supersonic reads the file/directory name off the disk using a specific encoding (Java uses UTF-8 by default) when it may have been written in some other character set in the first place. File systems (at least POSIX ones) store file/directory names as raw binary data using whatever encoding the writing application chose, so there's no way to determine the right decoding character when reading the data. This is unlike MP3's ID3 where encoding markers as used (though not always in a canonical manner -- see #54).

Another possibility: Supersonic's HSQLDB is storing the data in a different encoding.

Could you please try the following: Verify that you're using a UTF-8 encoding, make a copy of one of the affected video files on the shell (using a slightly different name which still contains special characters), and check afterwards if it is indexed and displayed correctly in Supersonic. My theory is that if the file name's encoding was flawed before, creating a copy of the file in a correctly set up (UTF-8) locale should force the shell to write a properly encoded file name.

I'll try to collect more information on Supersonic's inner mechanics w.r.t. video file indexing/parsing as soon as I can.

Cheers,

--Timo

@timoreimann
Copy link
Contributor

Addendum: It seems that Java isn't using UTF-8 by default when reading files off the disk; instead, the system encoding is used (source).

I also verified that in case of video files, Supersonic simply uses the file name as title (via the File class). So if anything goes wrong encoding-wise when reading in the file name, it ends up being broken in Supersonic.

@timoreimann
Copy link
Contributor

I just realized that you said

well, now, I have every LANG and LC_* variables as en_US.UTF-8.

Does that mean you had another locale configured when the file names in question were created?

If so, that'd explain the encoding mismatch and why Java fails to read in the file names correctly. Re-writing all names in UTF-8 should do the trick then.

A lot of speculating on my side. I'll refrain from putting down more theories until I hear from you again. ;)

@ppetak
Copy link
Author

ppetak commented Aug 29, 2012

well, I have included the terminal outputs of ls of the example file in first post. And, I have written that I have LANG and LC as en_US.UTF-8 only for the reason this is first advice anyone gets on the forum of subsonic. So, to make it clear, my system-wide encoding is en_US.UTF-8 (so all deamons have this encoding set). User session encoding is not explicitly set, so it is using also en_US.UTF-8. After login on any account:

[root@xynns /]# locale
LANG=en_US.UTF-8
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=

So, this information was there to not get advices about encoding settings :)

Anyway, I tried to rename something in shell, with no effect.

What is important in my opinion is the SQL command I found in log, which is clearly in bad encoding, so subsonic is getting the name from File class in bad encoding. So I made a short java program to list directory output, like this (shortened):

   Locale loc = Locale.getDefault() ;  
   System.out.println( "default locale: " + loc.toString() ) ;  
   File folder = new File("/home/public/video/_ceske");
 ...
   System.out.println(fileEntry.getName()+fileEntry);
 ...

and everything is OK in terminal output:

[root@xynss /]# java -jar ~/temp/localeTest.jar 
default locale: en_US
....
A bude hůř.avi/home/public/video/_ceske/A bude hůř.avi
....

So, it shows that there could be some problem in directory listing in subsonic - maybe some old self-written function which re-setting locale to C or I don't know.

I made all I can up to this point to find the problem - next only available step for me is debugging subsonic on my machine...

@timoreimann
Copy link
Contributor

Thanks for the added information. I guess your directory listing sample code kind of beats my "the filename's encoding must be messed up" argument.

I stepped through the relevant code myself yesterday. The method responsible for reading the meta data is MetaDataParser.getMetaData(). Since video files don't contain ID3-like tag information, the video title is guessed by a call to guessTitle(). It's implementation boils down to a call to removeTrackNumberFromTitle() which tries to strip any track number from the title first. The title, in essence, is provided like this:

FilenameUtils.getBaseName(file.getPath())

(line 129)

where file is of type File. So nothing particularly magic here as far as I can see.

For what it's worth, my Supersonic instance is having no issues with your filenames. I am using an all-UTF-8 system environment as well.

@ppetak
Copy link
Author

ppetak commented Aug 29, 2012

Well, ok, I cloned repository and I will try to build and debug it here, I haven't any deep java experience, in fact the test encoding program was one of my first java works ;) But I have already found the getMetaData method myself before you wrote, so I will give it a try.

@timoreimann
Copy link
Contributor

My post was meant as documentation for what I managed to figure out so far; I did not intend to push you into debugging things yourself. Sorry if I sounded anything like that.

If you feel like trying though, I'm the last person to hold you back. :) For additional debugging-related questions, feel free to use our developers group.

@ppetak
Copy link
Author

ppetak commented Aug 29, 2012

Well, the worst case happened. I have debugged it step by step with many problematic files - everything works. I didn't changed any configuration on server, only added debug parameters to java command line to be able to connect netbeans - I have headless server supersonic runs at.
So, I don't know where the problem was, or how to solve it for other people... Even after disconnecting debugger, cleaning database, and re-indexing files, everything works. I must add again that I have not changed any configuration on server...
Now it is closed for me, at least I have learned how to remotely debug java application. :) I plan to use supersonic on our server for 5-10 users, so maybe I will find something else to poke :)

Thanks for all your help.

@ppetak ppetak closed this as completed Aug 29, 2012
@timoreimann
Copy link
Contributor

You didn't upgrade Supersonic (e.g., checked out a more recent Github master copy) for debugging purposes which may have possibly fixed some issues, didn't you?

Unless you did I guess the best explanation is that you have just encountered your first Heisenbug! Congratulations, we've all been down that road. :)

If you ever happen to discover what caused the encoding troubles in the first place please let me know!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants