Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wrong accented characters in Portuguese Brazilian subtitles #375

Closed
andreoliwa opened this issue Apr 29, 2014 · 25 comments
Closed

Wrong accented characters in Portuguese Brazilian subtitles #375

andreoliwa opened this issue Apr 29, 2014 · 25 comments

Comments

@andreoliwa
Copy link

Hello,

First of all, thanks for Subliminal. ;)

Some accented characters are wrong in pt-BR subtitles.
I've noticed this encoding problem today for the second time.

Some examples:

  • O quЖ? (should be "O quê?");
  • Eu disse que sз queria falar. (should be "Eu disse que só queria falar.");
  • Legal nж? (should be "Legal né?");
  • ╔ eu vi. Agora vр sentar! (should be "É eu vi. Agora vá sentar!").

I'm not sure if this is a problem with a specific .srt file, or if it happens with all of them.
I manually downloaded one pt-BR subtitle of this movie from OpenSubtitles.org, and the accented characters are right.

I tried the verbose mode, but it doesn't show the full URL of the subtitle being downloaded.
I was planning to check directly against OpenSubtitles.org, to see if there are wrong subtitles among the 10 existing ones.

Here is the verbose output:

$ subliminal . -l en pt-BR --color -v 
INFO     [subliminal.video] Scanning directory u'/mnt/Movies/All/Oldboy [2003]'
INFO     [subliminal.video] Scanning video u'Oldboy [2003].avi' in u'/mnt/Movies/All/Oldboy [2003]'
INFO     [subliminal.api] Skipping provider 'tvsubtitles': no video to search for
INFO     [subliminal.api] Skipping provider 'addic7ed': no video to search for
INFO     [subliminal.api] Skipping provider 'bierdopje': no language to search for
INFO     [subliminal.api] Skipping provider 'thesubdb': no language to search for
INFO     [subliminal.api] Listing subtitles with provider 'opensubtitles' for video <Movie [u'Oldboy', 2003]> with languages set([<Language [pt-BR]>])
INFO     [subliminal.api] Found 10 subtitles
INFO     [subliminal.api] Listing subtitles with provider 'podnapisi' for video <Movie [u'Oldboy', 2003]> with languages set([<Language [pt-BR]>])
INFO     [subliminal.api] Found 4 subtitles
INFO     [subliminal.subtitle] Computed score 31 with matches set([u'title', u'hash', u'year'])
INFO     [subliminal.subtitle] Computed score 31 with matches set([u'title', u'hash', u'year'])
INFO     [subliminal.subtitle] Computed score 31 with matches set([u'title', u'hash', u'year'])
INFO     [subliminal.subtitle] Computed score 31 with matches set([u'hash', u'year'])
INFO     [subliminal.subtitle] Computed score 31 with matches set([u'hash', u'year'])
INFO     [subliminal.subtitle] Computed score 31 with matches set([u'title', u'hash', u'year'])
INFO     [subliminal.subtitle] Computed score 31 with matches set([u'title', u'hash', u'year'])
INFO     [subliminal.subtitle] Computed score 31 with matches set([u'title', u'hash', u'year'])
INFO     [subliminal.subtitle] Computed score 31 with matches set([u'title', u'hash', u'year'])
INFO     [subliminal.subtitle] Computed score 31 with matches set([u'hash', u'year'])
INFO     [subliminal.subtitle] Computed score 20 with matches set([u'year', u'title'])
INFO     [subliminal.subtitle] Computed score 20 with matches set([u'year', u'title'])
INFO     [subliminal.subtitle] Computed score 20 with matches set([u'year', u'title'])
INFO     [subliminal.subtitle] Computed score 20 with matches set([u'year', u'title'])
INFO     [subliminal.api] Downloading subtitle <OpenSubtitlesSubtitle [pt-BR]> with score 31 into u'/mnt/Movies/All/Oldboy [2003]/Oldboy [2003].pt.srt'
1 subtitle downloaded

This is my locale configuration:

$ export | grep LC_
declare -x LC_ADDRESS="pt_BR.UTF-8"
declare -x LC_IDENTIFICATION="pt_BR.UTF-8"
declare -x LC_MEASUREMENT="pt_BR.UTF-8"
declare -x LC_MONETARY="pt_BR.UTF-8"
declare -x LC_NAME="pt_BR.UTF-8"
declare -x LC_NUMERIC="pt_BR.UTF-8"
declare -x LC_PAPER="pt_BR.UTF-8"
declare -x LC_TELEPHONE="pt_BR.UTF-8"
declare -x LC_TIME="pt_BR.UTF-8"

If there is some other information I could provide to make it easier to find the problem, please let me know.

Thanks in advance.

@fadeldamen
Copy link

same problem

@geeanlooca
Copy link

This also happens with Italian subtitles on every accented letter.
I tried to run subliminal with the --compatibility parameter but on versione 0.7.4 on windows this parameter does not exist. Also, the -c parameter is for specifying the cache file and not for the compatibility mode as it can be found on the subliminal website.

Thank you, keep up the good work

@cgelici
Copy link

cgelici commented May 20, 2014

yeah same for turkish subs. I use filebot for those. It has a '--encoding utf8' option which fixes the subs.

@ivanlao
Copy link

ivanlao commented Nov 6, 2014

the exactly same problem for spanish subtitles.

  • ї"En serio" quй?

@ioExpander
Copy link

I'm having the same issues with french subtitles files that get corrupted when converted to UTF8 by subliminal 0.7.5.
Actually I managed to reproduce the issue on chardet 2.3.0 : chardet identifies my file as a Windows-1255 where it should be Latin1 encoding (confirmed using vim). When converting to utf8 I get the same scrambled accents as the version downloaded by subliminal.

Is there an option to force subliminal not to change the encoding of files downloaded ?
It might be a quick fix as my TV can read the original file without any issues.

@ioExpander
Copy link

Just had a look at the code of the master branch : the issue might have been fixed. I will give it a try.

Edit : just confirmed my issue has been fixed with the upcomming 0.8 release. thanks :)

@miigotu
Copy link
Contributor

miigotu commented Jul 9, 2015

@Diaoul maybe use codecs.open instead of open ?? It should auto detect the encoding if you do not specify.

@Diaoul
Copy link
Owner

Diaoul commented Jul 9, 2015

I doubt that as it is not mentioned in the docs. Where did you get that information?

@miigotu
Copy link
Contributor

miigotu commented Jul 9, 2015

From: https://docs.python.org/2/howto/unicode.html#reading-and-writing-unicode-data
encoding is a string giving the encoding to use; if it’s left as None, a regular Python file object that accepts 8-bit strings is returned. Otherwise, a wrapper object is returned, and data written to or read from the wrapper object will be converted as needed.

@Diaoul
Copy link
Owner

Diaoul commented Jul 9, 2015

You're mistaken in your interpretation of that sentence, it won't autodetect anything, it will just return bytes if no encoding is specified, string otherwise. Here is a quick example:

>>> import codecs
>>> open('test.srt', 'wb').write('ééé'.encode('latin-1'))  # a file with latin-1 encoding
>>> codecs.open('test.srt').read()
b'\xe9\xe9\xe9'  # plain bytes here, no auto detection
>>> codecs.open('test.srt', encoding='latin-1').read()
'ééé'  # now we have string

@miigotu
Copy link
Contributor

miigotu commented Jul 9, 2015

It is supposed to be bytes, use print:

>>> open('test.srt', 'wb').write('ééé')
>>> codecs.open('test.srt').read()
'\xc3\xa9\xc3\xa9\xc3\xa9'
>>> print codecs.open('test.srt').read()
ééé
>>>

@Diaoul
Copy link
Owner

Diaoul commented Jul 9, 2015

So if it's supposed to be bytes where is the encoding detection here? You are writing bytes and reading bytes, nothing crazy here. I suggest you practice a little bit more with encoding and use python3 to do it because unicode and python2 don't really get along and you might get wrong ideas like thinking you're manipulating strings while actually you're just manipulating bytes.

Here is with python3:

>>> open('test.srt', 'wb').write('ééé')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: 'str' does not support the buffer interface

@miigotu
Copy link
Contributor

miigotu commented Jul 10, 2015

You don't need to be encoding or decoding anything, because you are downloading the file as bytes into a variable, and you want to write the file directly from the variable without any conversion at all. This eliminates any possible encode/decode conversion issues altogether.

Python 3 screws that method up I see, but there must be a way to avoid the conversion like in Python2

'wb' should be taking bytes, not strings.

Python 3.1 this would work.

@miigotu
Copy link
Contributor

miigotu commented Jul 10, 2015

@Diaoul codecs.open just handles it properly.

>>> open('test.srt', 'wb').write('ééé'.encode('latin-1'))
3
>>> open('test.srt').read()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python3.4/codecs.py", line 313, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position 0: invalid continuation byte
>>> codecs.open('test.srt').read()
b'\xe9\xe9\xe9'

Open the srt in something other than terminal (gedit here) and it will not display right in utf-8, but it does if you let it switch to displaying as latin-1.

@miigotu
Copy link
Contributor

miigotu commented Jul 10, 2015

So maybe im getting twisted around here =P

@Diaoul
Copy link
Owner

Diaoul commented Jul 10, 2015

There's already an option for that: https://github.com/Diaoul/subliminal/blob/master/subliminal/cli.py#L62-L63

Subliminal is not only about saving subtitles to file, one might want to edit the subtitles, save only as utf-8, search in content or I don't know what. Language gives a good indication on which encoding is used so I provide that guessed_encoding property: https://github.com/Diaoul/subliminal/blob/master/subliminal/subtitle.py#L38-L74

You rely on your media player for guessing file encoding and luckily for you it works. Some others player refuse to play something encoded differently than utf-8.

@miigotu
Copy link
Contributor

miigotu commented Jul 10, 2015

You know more about it than I do, just trying to help since we use subliminal in our app xD

@cosad3s
Copy link

cosad3s commented Jul 17, 2015

Same problem for me with French subtitles. Details :

On my Synology DS214Play (DSM 5.2-5565), some subtitles are written with bad encoding.
For example, in French --> The.Shawshank.Redemption.1994.720p.BluRay.x264-SiNNERS.srt.

  1. Download from Subliminal with OpenSubtitles provider --> ANSI as UTF8 (said by NotePad++)
  2. Direct download from OpenSubtitles --> ANSI (said by NotePad++)

Differences (hex) :

GOOD ENCODING (direct dl from OpenSubtitles), extract
00000060: 3A 30 32 3A 31 31 2C 36 30 30 0D 0A 76 6F 74 72 :02:11,600..votr
00000070: 65 20 64 69 73 70 75 74 65 20 61 76 65 63 20 76 e dispute avec v
00000080: 6F 74 72 65 20 66 65 6D 6D 65 2C 0D 0A 6C 61 20 otre femme,..la
00000090: 6E 75 69 74 20 6F F9 20 65 6C 6C 65 20 61 20 E9 nuit où elle a é
000000A0: 74 E9 20 74 75 E9 65 2E 0D 0A 0D 0A 33 0D 0A 30 té tuée.....3..0

BAD ENCODING (from Subliminal with OpenSubtitles provider), extract
00000060: 3A 30 32 3A 31 31 2C 36 30 30 0D 0A 76 6F 74 72 :02:11,600..votr
00000070: 65 20 64 69 73 70 75 74 65 20 61 76 65 63 20 76 e dispute avec v
00000080: 6F 74 72 65 20 66 65 6D 6D 65 2C 0D 0A 6C 61 20 otre femme,..la
00000090: 6E 75 69 74 20 6F D1 89 20 65 6C 6C 65 20 61 20 nuit oщ elle a
000000A0: D0 B6 74 D0 B6 20 74 75 D0 B6 65 2E 0D 0A 0D 0A жtж tuжe.....

& It is not related to Synology : SynoCommunity/spksrc#1697 (comment)

Is it caused by Python 2.7.3 ?
Is there are direct solution (which .py file to modify ?)

@miigotu
Copy link
Contributor

miigotu commented Jul 17, 2015

@Cosades what codepage is your syno using?

@miigotu
Copy link
Contributor

miigotu commented Jul 17, 2015

@Diaoul A lot of encoding errors in SR were fixed when I updated chardet.
It was guessing a thai encoding for pt_BR/pt_PT instead of utf-8 with 99% score
The new one guesses utf-8 with 50.5% score.

I know you only use it as a last resort, but maybe it helps point in the right direction.

@cosad3s
Copy link

cosad3s commented Jul 17, 2015

@miigotu : locale :
LANG=en_US.utf8
LC_CTYPE="en_US.utf8"
LC_NUMERIC="en_US.utf8"
LC_TIME="en_US.utf8"
LC_COLLATE="en_US.utf8"
LC_MONETARY="en_US.utf8"
LC_MESSAGES="en_US.utf8"
LC_PAPER="en_US.utf8"
LC_NAME="en_US.utf8"
LC_ADDRESS="en_US.utf8"
LC_TELEPHONE="en_US.utf8"
LC_MEASUREMENT="en_US.utf8"
LC_IDENTIFICATION="en_US.utf8"
LC_ALL=en_US.utf8

My HMI is in french...
So, do I have to understand that Subliminal make its guesses in function of the locale ?
If not, what are the setting for the best guess ?

Subliminal could be more adaptive with advanced settings for users who want to tune input/output subtitles. Settings for guesses could be an interesting option & it could externalize some variables from code...

@miigotu
Copy link
Contributor

miigotu commented Jul 17, 2015

It's not subliminal who makes the assumption based on locale, python does. That's why I was discussing it with Diaoul beforem about codecs.open.

@miigotu
Copy link
Contributor

miigotu commented Jul 17, 2015

@Cosades curious which provider this is, and if it is only either opensubtitles or only the other ones. The other 4 use requests, opensubtitles does not. Might help pinpoint an issue if it is limited to either opensubtitles or non-opensubtitles.

@Diaoul podnapisi line 95 (and in other providers):
r.content returns an ascii result string (pre-encoded???)
r.text returns the unicode result string (or pre-decoded???)
Make a difference ?

@Diaoul
Copy link
Owner

Diaoul commented Jul 18, 2015

Yes content is bytes and text is string with guessed encoding by requests.
Except requests guesses encoding using http headers and html meta. There is
none of this for subtitles downloads so it gives pretty bad results.

If you don't care about subtitle file encoding and want to leave the
encoding guess to your media center save subtitle.content and not
subtitle.text property as bytes (open with wb mode).

@Diaoul Diaoul removed this from the 1.0 milestone Jul 22, 2015
@Diaoul Diaoul modified the milestones: 1.1, 1.0 Jul 22, 2015
@Diaoul Diaoul removed this from the 1.1 milestone Sep 6, 2015
@Diaoul
Copy link
Owner

Diaoul commented Oct 30, 2015

Closing in favor of #528

@Diaoul Diaoul closed this as completed Oct 30, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

9 participants