-
-
Notifications
You must be signed in to change notification settings - Fork 313
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Wrong accented characters in Portuguese Brazilian subtitles #375
Comments
same problem |
This also happens with Italian subtitles on every accented letter. Thank you, keep up the good work |
yeah same for turkish subs. I use filebot for those. It has a '--encoding utf8' option which fixes the subs. |
the exactly same problem for spanish subtitles.
|
I'm having the same issues with french subtitles files that get corrupted when converted to UTF8 by subliminal 0.7.5. Is there an option to force subliminal not to change the encoding of files downloaded ? |
Just had a look at the code of the master branch : the issue might have been fixed. I will give it a try. Edit : just confirmed my issue has been fixed with the upcomming 0.8 release. thanks :) |
@Diaoul maybe use codecs.open instead of open ?? It should auto detect the encoding if you do not specify. |
I doubt that as it is not mentioned in the docs. Where did you get that information? |
From: https://docs.python.org/2/howto/unicode.html#reading-and-writing-unicode-data |
You're mistaken in your interpretation of that sentence, it won't autodetect anything, it will just return bytes if no encoding is specified, string otherwise. Here is a quick example: >>> import codecs
>>> open('test.srt', 'wb').write('ééé'.encode('latin-1')) # a file with latin-1 encoding
>>> codecs.open('test.srt').read()
b'\xe9\xe9\xe9' # plain bytes here, no auto detection
>>> codecs.open('test.srt', encoding='latin-1').read()
'ééé' # now we have string |
It is supposed to be bytes, use print: >>> open('test.srt', 'wb').write('ééé')
>>> codecs.open('test.srt').read()
'\xc3\xa9\xc3\xa9\xc3\xa9'
>>> print codecs.open('test.srt').read()
ééé
>>> |
So if it's supposed to be bytes where is the encoding detection here? You are writing bytes and reading bytes, nothing crazy here. I suggest you practice a little bit more with encoding and use python3 to do it because unicode and python2 don't really get along and you might get wrong ideas like thinking you're manipulating strings while actually you're just manipulating bytes. Here is with python3: >>> open('test.srt', 'wb').write('ééé')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: 'str' does not support the buffer interface |
You don't need to be encoding or decoding anything, because you are downloading the file as bytes into a variable, and you want to write the file directly from the variable without any conversion at all. This eliminates any possible encode/decode conversion issues altogether. Python 3 screws that method up I see, but there must be a way to avoid the conversion like in Python2 'wb' should be taking bytes, not strings. Python 3.1 this would work. |
@Diaoul codecs.open just handles it properly. >>> open('test.srt', 'wb').write('ééé'.encode('latin-1'))
3
>>> open('test.srt').read()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python3.4/codecs.py", line 313, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position 0: invalid continuation byte
>>> codecs.open('test.srt').read()
b'\xe9\xe9\xe9' Open the srt in something other than terminal (gedit here) and it will not display right in utf-8, but it does if you let it switch to displaying as latin-1. |
So maybe im getting twisted around here =P |
There's already an option for that: https://github.com/Diaoul/subliminal/blob/master/subliminal/cli.py#L62-L63 Subliminal is not only about saving subtitles to file, one might want to edit the subtitles, save only as utf-8, search in content or I don't know what. Language gives a good indication on which encoding is used so I provide that guessed_encoding property: https://github.com/Diaoul/subliminal/blob/master/subliminal/subtitle.py#L38-L74 You rely on your media player for guessing file encoding and luckily for you it works. Some others player refuse to play something encoded differently than utf-8. |
You know more about it than I do, just trying to help since we use subliminal in our app xD |
Same problem for me with French subtitles. Details : On my Synology DS214Play (DSM 5.2-5565), some subtitles are written with bad encoding.
Differences (hex) : GOOD ENCODING (direct dl from OpenSubtitles), extract BAD ENCODING (from Subliminal with OpenSubtitles provider), extract & It is not related to Synology : SynoCommunity/spksrc#1697 (comment) Is it caused by Python 2.7.3 ? |
@Cosades what codepage is your syno using? |
@Diaoul A lot of encoding errors in SR were fixed when I updated chardet. I know you only use it as a last resort, but maybe it helps point in the right direction. |
@miigotu : locale : My HMI is in french... Subliminal could be more adaptive with advanced settings for users who want to tune input/output subtitles. Settings for guesses could be an interesting option & it could externalize some variables from code... |
It's not subliminal who makes the assumption based on locale, python does. That's why I was discussing it with Diaoul beforem about codecs.open. |
@Cosades curious which provider this is, and if it is only either opensubtitles or only the other ones. The other 4 use requests, opensubtitles does not. Might help pinpoint an issue if it is limited to either opensubtitles or non-opensubtitles. @Diaoul podnapisi line 95 (and in other providers): |
Yes content is bytes and text is string with guessed encoding by requests. If you don't care about subtitle file encoding and want to leave the |
Closing in favor of #528 |
Hello,
First of all, thanks for Subliminal. ;)
Some accented characters are wrong in pt-BR subtitles.
I've noticed this encoding problem today for the second time.
Some examples:
I'm not sure if this is a problem with a specific .srt file, or if it happens with all of them.
I manually downloaded one pt-BR subtitle of this movie from OpenSubtitles.org, and the accented characters are right.
I tried the verbose mode, but it doesn't show the full URL of the subtitle being downloaded.
I was planning to check directly against OpenSubtitles.org, to see if there are wrong subtitles among the 10 existing ones.
Here is the verbose output:
This is my locale configuration:
If there is some other information I could provide to make it easier to find the problem, please let me know.
Thanks in advance.
The text was updated successfully, but these errors were encountered: