Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fixing the display of gibberish instead of hebrew in file names and ID3 tags #222

Closed
eliargon opened this issue Dec 27, 2019 · 4 comments
Closed

Comments

@eliargon
Copy link

I have a library of Israeli music folders on my Windows PC. The file names are composed of Hebrew characters as well as the ID3 tags encoded in the mp3 files. I have used MediaMonkey to manage my music and tag the files and now I tried to do it myself with PHP.
Using the getIDS lib I ran the demo.browse.php file on one of the folders and got the following display
image
Fixing it required the following:
1- changing 'ISO-8859-1' with 'UTF-8' in lines 84, 116, 217,238
now the file names are displayed correctly.
the next task was replacing the gibberish artist and title with the equivalent Hebrew characters
äçìåðåú äâáåäéí àäáä øàùåðä
should be displayed as
החלונות הגבוהים אהבה ראשונה
Looking at a bunch of conversion routines without success, I finally created a simple translation scheme with 2 arrays

$gib2 = array('/à/','/á/','/â/','/ã/','/ä/',
    '/å/','/æ/','/ç/','/è/','/é/','/ë/','/ì/',
    '/î/','/ð/','/ñ/','/ò/','/ô/','/ö/','/÷/',
    '/ø/','/ù/','/ú/','/ê/','/í/','/ï/','/ó/','/õ/');
$heb = array('א','ב','ג','ד','ה','ו','ז','ח','ט','י','כ','ל','מ','נ','ס','ע','פ','צ','ק','ר','ש','ת','ך','ם','ן','ף','ץ');

passing the string to the preg_replace() function will emit the correct hebrew characters

$str='äçìåðåú äâáåäéí | àäáä øàùåðä';
echo preg_replace($gib2, $heb, $str);
// displays: החלונות הגבוהים	 אהבה ראשונה

Well, not quite. the program HTML encode the foriegn characters so that they are displayed as
&#[0-9]+ entities. Those need to be convered to UTF-8,
I found a piece of code in PHP.NET

<?php
/* Entity crap. /
$input = "Fovi&#269;";

$output = preg_replace_callback("/(&#[0-9]+;)/", function($m) { return mb_convert_encoding($m[1], "UTF-8", "HTML-ENTITIES"); }, $input);

/* Plain UTF-8. */
echo $output;
?>

To make it all work in demo.browse.php you need to modify lines 279 and 280 with the following lines:

// added lines
                    $artist=preg_replace($gib2,$heb, preg_replace_callback("/(&#[0-9]+;)/", function($m) { return mb_convert_encoding($m[1], "UTF-8", "HTML-ENTITIES"); },@implode('',$fileinfo['comments_html']['artist'])));
                    $title=preg_replace($gib2,$heb, preg_replace_callback("/(&#[0-9]+;)/", function($m) { return mb_convert_encoding($m[1], "UTF-8", "HTML-ENTITIES"); },@implode('',$fileinfo['comments_html']['title'])));
// modifed original lines
			echo '<td align="left">&nbsp;'.(isset($fileinfo['comments_html']['artist']) ? $artist : ((isset($fileinfo['video']['resolution_x']) && isset($fileinfo['video']['resolution_y'])) ? $fileinfo['video']['resolution_x'].'x'.$fileinfo['video']['resolution_y'] : '')).'</td>';
					echo '<td align="left">&nbsp;'.(isset($fileinfo['comments_html']['title'])? $title  :  (isset($fileinfo['video']['frame_rate'])  ? number_format($fileinfo['video']['frame_rate'], 3).'fps'                  : '')).'</td>';

finally, adding the two arrays $gib2 and $heb at the begining of the file, just below the
$PageEncoding = 'UTF-8'; line.
and now the display is correct
image

That's it. If someone can do it more efficient, I would love to hear about it.

p.s. if anyone wonders 'אהבה ראשונה' means first love. and you can listen to the song on YouTube

Eli Argon

@eliargon eliargon changed the title fixing the display of gibberish instead on hebrew in file names and ID3 tags fixing the display of gibberish instead of hebrew in file names and ID3 tags Dec 27, 2019
JamesHeinrich added a commit that referenced this issue Dec 27, 2019
#222
Default filesystem character encoding to UTF-8, except for Windows and
PHP<7.1 which defaults to ISO-8859-1 (but the user may want/need to
change this to their local system character encoding setting)
@JamesHeinrich
Copy link
Owner

It sounds from your description that you have Windows set to Windows-1255 character encoding which remaps the upper characters (171-254) to Hebrew characters.

As I understand it, Windows uses UTF-16 encoding for filenames, but this is mapped into 8-bit encoding for "non-UTF8-aware" programs, which includes PHP < 7.1.0 which can make it difficult (or impossible) to work with file containing "foreign" (to the system codepage) characters since they can't be mapped. For example, in my Windows7-PHPv7.0.24 test environment I have a file named CP-1255 החלונות הגבוהים אהבה ראשונה.mp3 but the PHP filesystem functions only see it as CP-1255 ??????? ??????? ???? ??????.mp3 because Hebrew characters are not present in Windows-1252.

I have made changes in 35d752f (and e08fbde) that sets the default filesystem encoding to UTF-8 in most cases, except for (Windows + PHP<7.1) which defaults to Windows-1252. For users like @eliargon you may want/need to change that to the encoding to match your system, e.g. Windows-1255. I don't think it's possible to detect this automatically.

The better solution is to upgrade your PHP installation to at least v7.1 which supports UTF-8 representation of filenames and should eliminate all this kind of confusion.

@eliargon
Copy link
Author

Hi James, Thanks for the quick and detailed response.
I am running your code on Windows NT DESKTOP-IB8I3RS 10.0 build 18362 (Windows 10) AMD64.
using XAMPP platform with PHP 7.3.7. I will check the rest of your comments later and hope to find a more elegant solution. Keep up the good work.

@eliargon
Copy link
Author

Just checked it. Your new commit fixed the display of the directory (Files in ......) and the directory listing of file names.
Still, the text in the "Artist' and 'Title' remain gibberish. thus the use of
preg_replace($gib2,$heb, preg_replace_callback(... is needed for a correct display.

@eliargon
Copy link
Author

Well, I have to eat my words. I checked you code with a newer album and the text in the tags displays both in Hebrew (Correctly!) and in another language that looks like Russian.
image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants