Wrong accented characters in Portuguese Brazilian subtitles #375

andreoliwa · 2014-04-29T12:25:30Z

Hello,

First of all, thanks for Subliminal. ;)

Some accented characters are wrong in pt-BR subtitles.
I've noticed this encoding problem today for the second time.

Some examples:

O quЖ? (should be "O quê?");
Eu disse que sз queria falar. (should be "Eu disse que só queria falar.");
Legal nж? (should be "Legal né?");
╔ eu vi. Agora vр sentar! (should be "É eu vi. Agora vá sentar!").

I'm not sure if this is a problem with a specific .srt file, or if it happens with all of them.
I manually downloaded one pt-BR subtitle of this movie from OpenSubtitles.org, and the accented characters are right.

I tried the verbose mode, but it doesn't show the full URL of the subtitle being downloaded.
I was planning to check directly against OpenSubtitles.org, to see if there are wrong subtitles among the 10 existing ones.

Here is the verbose output:

$ subliminal . -l en pt-BR --color -v 
INFO     [subliminal.video] Scanning directory u'/mnt/Movies/All/Oldboy [2003]'
INFO     [subliminal.video] Scanning video u'Oldboy [2003].avi' in u'/mnt/Movies/All/Oldboy [2003]'
INFO     [subliminal.api] Skipping provider 'tvsubtitles': no video to search for
INFO     [subliminal.api] Skipping provider 'addic7ed': no video to search for
INFO     [subliminal.api] Skipping provider 'bierdopje': no language to search for
INFO     [subliminal.api] Skipping provider 'thesubdb': no language to search for
INFO     [subliminal.api] Listing subtitles with provider 'opensubtitles' for video <Movie [u'Oldboy', 2003]> with languages set([<Language [pt-BR]>])
INFO     [subliminal.api] Found 10 subtitles
INFO     [subliminal.api] Listing subtitles with provider 'podnapisi' for video <Movie [u'Oldboy', 2003]> with languages set([<Language [pt-BR]>])
INFO     [subliminal.api] Found 4 subtitles
INFO     [subliminal.subtitle] Computed score 31 with matches set([u'title', u'hash', u'year'])
INFO     [subliminal.subtitle] Computed score 31 with matches set([u'title', u'hash', u'year'])
INFO     [subliminal.subtitle] Computed score 31 with matches set([u'title', u'hash', u'year'])
INFO     [subliminal.subtitle] Computed score 31 with matches set([u'hash', u'year'])
INFO     [subliminal.subtitle] Computed score 31 with matches set([u'hash', u'year'])
INFO     [subliminal.subtitle] Computed score 31 with matches set([u'title', u'hash', u'year'])
INFO     [subliminal.subtitle] Computed score 31 with matches set([u'title', u'hash', u'year'])
INFO     [subliminal.subtitle] Computed score 31 with matches set([u'title', u'hash', u'year'])
INFO     [subliminal.subtitle] Computed score 31 with matches set([u'title', u'hash', u'year'])
INFO     [subliminal.subtitle] Computed score 31 with matches set([u'hash', u'year'])
INFO     [subliminal.subtitle] Computed score 20 with matches set([u'year', u'title'])
INFO     [subliminal.subtitle] Computed score 20 with matches set([u'year', u'title'])
INFO     [subliminal.subtitle] Computed score 20 with matches set([u'year', u'title'])
INFO     [subliminal.subtitle] Computed score 20 with matches set([u'year', u'title'])
INFO     [subliminal.api] Downloading subtitle <OpenSubtitlesSubtitle [pt-BR]> with score 31 into u'/mnt/Movies/All/Oldboy [2003]/Oldboy [2003].pt.srt'
1 subtitle downloaded

This is my locale configuration:

$ export | grep LC_
declare -x LC_ADDRESS="pt_BR.UTF-8"
declare -x LC_IDENTIFICATION="pt_BR.UTF-8"
declare -x LC_MEASUREMENT="pt_BR.UTF-8"
declare -x LC_MONETARY="pt_BR.UTF-8"
declare -x LC_NAME="pt_BR.UTF-8"
declare -x LC_NUMERIC="pt_BR.UTF-8"
declare -x LC_PAPER="pt_BR.UTF-8"
declare -x LC_TELEPHONE="pt_BR.UTF-8"
declare -x LC_TIME="pt_BR.UTF-8"

If there is some other information I could provide to make it easier to find the problem, please let me know.

Thanks in advance.

The text was updated successfully, but these errors were encountered:

fadeldamen · 2014-05-04T12:34:18Z

same problem

geeanlooca · 2014-05-10T00:11:29Z

This also happens with Italian subtitles on every accented letter.
I tried to run subliminal with the --compatibility parameter but on versione 0.7.4 on windows this parameter does not exist. Also, the -c parameter is for specifying the cache file and not for the compatibility mode as it can be found on the subliminal website.

Thank you, keep up the good work

cgelici · 2014-05-20T22:13:32Z

yeah same for turkish subs. I use filebot for those. It has a '--encoding utf8' option which fixes the subs.

ivanlao · 2014-11-06T19:41:19Z

the exactly same problem for spanish subtitles.

ї"En serio" quй?

ioExpander · 2015-03-22T19:36:47Z

I'm having the same issues with french subtitles files that get corrupted when converted to UTF8 by subliminal 0.7.5.
Actually I managed to reproduce the issue on chardet 2.3.0 : chardet identifies my file as a Windows-1255 where it should be Latin1 encoding (confirmed using vim). When converting to utf8 I get the same scrambled accents as the version downloaded by subliminal.

Is there an option to force subliminal not to change the encoding of files downloaded ?
It might be a quick fix as my TV can read the original file without any issues.

ioExpander · 2015-03-23T07:10:58Z

Just had a look at the code of the master branch : the issue might have been fixed. I will give it a try.

Edit : just confirmed my issue has been fixed with the upcomming 0.8 release. thanks :)

miigotu · 2015-07-09T12:22:52Z

@Diaoul maybe use codecs.open instead of open ?? It should auto detect the encoding if you do not specify.

Diaoul · 2015-07-09T14:50:04Z

I doubt that as it is not mentioned in the docs. Where did you get that information?

miigotu · 2015-07-09T21:39:20Z

From: https://docs.python.org/2/howto/unicode.html#reading-and-writing-unicode-data
encoding is a string giving the encoding to use; if it’s left as None, a regular Python file object that accepts 8-bit strings is returned. Otherwise, a wrapper object is returned, and data written to or read from the wrapper object will be converted as needed.

Diaoul · 2015-07-09T22:01:33Z

You're mistaken in your interpretation of that sentence, it won't autodetect anything, it will just return bytes if no encoding is specified, string otherwise. Here is a quick example:

>>> import codecs
>>> open('test.srt', 'wb').write('ééé'.encode('latin-1'))  # a file with latin-1 encoding
>>> codecs.open('test.srt').read()
b'\xe9\xe9\xe9'  # plain bytes here, no auto detection
>>> codecs.open('test.srt', encoding='latin-1').read()
'ééé'  # now we have string

miigotu · 2015-07-09T22:22:36Z

It is supposed to be bytes, use print:

>>> open('test.srt', 'wb').write('ééé')
>>> codecs.open('test.srt').read()
'\xc3\xa9\xc3\xa9\xc3\xa9'
>>> print codecs.open('test.srt').read()
ééé
>>>

Diaoul · 2015-07-09T23:50:45Z

So if it's supposed to be bytes where is the encoding detection here? You are writing bytes and reading bytes, nothing crazy here. I suggest you practice a little bit more with encoding and use python3 to do it because unicode and python2 don't really get along and you might get wrong ideas like thinking you're manipulating strings while actually you're just manipulating bytes.

Here is with python3:

>>> open('test.srt', 'wb').write('ééé')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: 'str' does not support the buffer interface

miigotu · 2015-07-10T02:18:03Z

You don't need to be encoding or decoding anything, because you are downloading the file as bytes into a variable, and you want to write the file directly from the variable without any conversion at all. This eliminates any possible encode/decode conversion issues altogether.

Python 3 screws that method up I see, but there must be a way to avoid the conversion like in Python2

'wb' should be taking bytes, not strings.

Python 3.1 this would work.

miigotu · 2015-07-10T02:32:32Z

@Diaoul codecs.open just handles it properly.

>>> open('test.srt', 'wb').write('ééé'.encode('latin-1'))
3
>>> open('test.srt').read()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python3.4/codecs.py", line 313, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position 0: invalid continuation byte
>>> codecs.open('test.srt').read()
b'\xe9\xe9\xe9'

Open the srt in something other than terminal (gedit here) and it will not display right in utf-8, but it does if you let it switch to displaying as latin-1.

miigotu · 2015-07-10T02:34:42Z

So maybe im getting twisted around here =P

Diaoul · 2015-07-10T07:56:06Z

There's already an option for that: https://github.com/Diaoul/subliminal/blob/master/subliminal/cli.py#L62-L63

Subliminal is not only about saving subtitles to file, one might want to edit the subtitles, save only as utf-8, search in content or I don't know what. Language gives a good indication on which encoding is used so I provide that guessed_encoding property: https://github.com/Diaoul/subliminal/blob/master/subliminal/subtitle.py#L38-L74

You rely on your media player for guessing file encoding and luckily for you it works. Some others player refuse to play something encoded differently than utf-8.

miigotu · 2015-07-10T09:05:43Z

You know more about it than I do, just trying to help since we use subliminal in our app xD

cosad3s · 2015-07-17T21:20:43Z

Same problem for me with French subtitles. Details :

On my Synology DS214Play (DSM 5.2-5565), some subtitles are written with bad encoding.
For example, in French --> The.Shawshank.Redemption.1994.720p.BluRay.x264-SiNNERS.srt.

Download from Subliminal with OpenSubtitles provider --> ANSI as UTF8 (said by NotePad++)
Direct download from OpenSubtitles --> ANSI (said by NotePad++)

Differences (hex) :

GOOD ENCODING (direct dl from OpenSubtitles), extract
00000060: 3A 30 32 3A 31 31 2C 36 30 30 0D 0A 76 6F 74 72 :02:11,600..votr
00000070: 65 20 64 69 73 70 75 74 65 20 61 76 65 63 20 76 e dispute avec v
00000080: 6F 74 72 65 20 66 65 6D 6D 65 2C 0D 0A 6C 61 20 otre femme,..la
00000090: 6E 75 69 74 20 6F F9 20 65 6C 6C 65 20 61 20 E9 nuit où elle a é
000000A0: 74 E9 20 74 75 E9 65 2E 0D 0A 0D 0A 33 0D 0A 30 té tuée.....3..0

BAD ENCODING (from Subliminal with OpenSubtitles provider), extract
00000060: 3A 30 32 3A 31 31 2C 36 30 30 0D 0A 76 6F 74 72 :02:11,600..votr
00000070: 65 20 64 69 73 70 75 74 65 20 61 76 65 63 20 76 e dispute avec v
00000080: 6F 74 72 65 20 66 65 6D 6D 65 2C 0D 0A 6C 61 20 otre femme,..la
00000090: 6E 75 69 74 20 6F D1 89 20 65 6C 6C 65 20 61 20 nuit oÑ‰ elle a
000000A0: D0 B6 74 D0 B6 20 74 75 D0 B6 65 2E 0D 0A 0D 0A Ð¶tÐ¶ tuÐ¶e.....

& It is not related to Synology : SynoCommunity/spksrc#1697 (comment)

Is it caused by Python 2.7.3 ?
Is there are direct solution (which .py file to modify ?)

miigotu · 2015-07-17T21:29:38Z

@Cosades what codepage is your syno using?

miigotu · 2015-07-17T21:33:11Z

@Diaoul A lot of encoding errors in SR were fixed when I updated chardet.
It was guessing a thai encoding for pt_BR/pt_PT instead of utf-8 with 99% score
The new one guesses utf-8 with 50.5% score.

I know you only use it as a last resort, but maybe it helps point in the right direction.

cosad3s · 2015-07-17T21:39:28Z

@miigotu : locale :
LANG=en_US.utf8
LC_CTYPE="en_US.utf8"
LC_NUMERIC="en_US.utf8"
LC_TIME="en_US.utf8"
LC_COLLATE="en_US.utf8"
LC_MONETARY="en_US.utf8"
LC_MESSAGES="en_US.utf8"
LC_PAPER="en_US.utf8"
LC_NAME="en_US.utf8"
LC_ADDRESS="en_US.utf8"
LC_TELEPHONE="en_US.utf8"
LC_MEASUREMENT="en_US.utf8"
LC_IDENTIFICATION="en_US.utf8"
LC_ALL=en_US.utf8

My HMI is in french...
So, do I have to understand that Subliminal make its guesses in function of the locale ?
If not, what are the setting for the best guess ?

Subliminal could be more adaptive with advanced settings for users who want to tune input/output subtitles. Settings for guesses could be an interesting option & it could externalize some variables from code...

miigotu · 2015-07-17T21:45:03Z

It's not subliminal who makes the assumption based on locale, python does. That's why I was discussing it with Diaoul beforem about codecs.open.

miigotu · 2015-07-17T22:19:07Z

@Cosades curious which provider this is, and if it is only either opensubtitles or only the other ones. The other 4 use requests, opensubtitles does not. Might help pinpoint an issue if it is limited to either opensubtitles or non-opensubtitles.

@Diaoul podnapisi line 95 (and in other providers):
r.content returns an ascii result string (pre-encoded???)
r.text returns the unicode result string (or pre-decoded???)
Make a difference ?

Diaoul · 2015-07-18T14:32:31Z

Yes content is bytes and text is string with guessed encoding by requests.
Except requests guesses encoding using http headers and html meta. There is
none of this for subtitles downloads so it gives pretty bad results.

If you don't care about subtitle file encoding and want to leave the
encoding guess to your media center save subtitle.content and not
subtitle.text property as bytes (open with wb mode).

Diaoul · 2015-10-30T09:34:09Z

Closing in favor of #528

gilgamezh mentioned this issue Oct 26, 2014

Some times the subtitles have a bad encoding touchandgo-devs/touchandgo#30

Closed

Diaoul added type/bug source/encoding labels Jul 5, 2015

Diaoul modified the milestone: 1.0 Jul 6, 2015

Diaoul removed this from the 1.0 milestone Jul 22, 2015

Diaoul modified the milestones: 1.1, 1.0 Jul 22, 2015

Diaoul removed this from the 1.1 milestone Sep 6, 2015

Diaoul closed this as completed Oct 30, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Wrong accented characters in Portuguese Brazilian subtitles #375

Wrong accented characters in Portuguese Brazilian subtitles #375

andreoliwa commented Apr 29, 2014

fadeldamen commented May 4, 2014

geeanlooca commented May 10, 2014

cgelici commented May 20, 2014

ivanlao commented Nov 6, 2014

ioExpander commented Mar 22, 2015

ioExpander commented Mar 23, 2015

miigotu commented Jul 9, 2015

Diaoul commented Jul 9, 2015

miigotu commented Jul 9, 2015

Diaoul commented Jul 9, 2015

miigotu commented Jul 9, 2015

Diaoul commented Jul 9, 2015

miigotu commented Jul 10, 2015

miigotu commented Jul 10, 2015

miigotu commented Jul 10, 2015

Diaoul commented Jul 10, 2015

miigotu commented Jul 10, 2015

cosad3s commented Jul 17, 2015

miigotu commented Jul 17, 2015

miigotu commented Jul 17, 2015

cosad3s commented Jul 17, 2015

miigotu commented Jul 17, 2015

miigotu commented Jul 17, 2015

Diaoul commented Jul 18, 2015

Diaoul commented Oct 30, 2015

Wrong accented characters in Portuguese Brazilian subtitles #375

Wrong accented characters in Portuguese Brazilian subtitles #375

Comments

andreoliwa commented Apr 29, 2014

fadeldamen commented May 4, 2014

geeanlooca commented May 10, 2014

cgelici commented May 20, 2014

ivanlao commented Nov 6, 2014

ioExpander commented Mar 22, 2015

ioExpander commented Mar 23, 2015

miigotu commented Jul 9, 2015

Diaoul commented Jul 9, 2015

miigotu commented Jul 9, 2015

Diaoul commented Jul 9, 2015

miigotu commented Jul 9, 2015

Diaoul commented Jul 9, 2015

miigotu commented Jul 10, 2015

miigotu commented Jul 10, 2015

miigotu commented Jul 10, 2015

Diaoul commented Jul 10, 2015

miigotu commented Jul 10, 2015

cosad3s commented Jul 17, 2015

miigotu commented Jul 17, 2015

miigotu commented Jul 17, 2015

cosad3s commented Jul 17, 2015

miigotu commented Jul 17, 2015

miigotu commented Jul 17, 2015

Diaoul commented Jul 18, 2015

Diaoul commented Oct 30, 2015