-
Notifications
You must be signed in to change notification settings - Fork 106
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GB18030 file detected as WINDOWS-1252 #23
Comments
Hi, On Wed, Jan 20, 2016 at 3:22 PM, Xijian Yan notifications@github.com
Could you send the file and make a bug report about it? Jehan
|
@Jehan I'm home now. I will send it to you at work tomorrow. Thanks for concerns. :) |
my love gbk.txt |
Ok I had a quick look. It looks like the current implementation only even tries to detect GB18030 when the escape characters are used, which is not the case of your file. Apparently there are several ways to encode the same characters, sometimes with escape characters, sometimes not (from the few I can understand with a quick look on Wikipedia). Also I guess that your file being mostly English, it does not help, statistically. I'll need to have a closer look to the structure of this encoding later so that I can try and fix this. Thanks for the report! |
I read some articles. The reason might be using GBK or GB18030 characters in file encoded in GB2312. GB2312, GBK, GB18030 are all Simplified Chinese encoding. GB18030 is the latest. |
Moving bug reports to the new project hosting. |
I encountered an issue when I try to use uchardet to detect a file almost all English(a very little Chinese).
The file is GB18030 encoded but detected as WINDOWS-1252.
The text was updated successfully, but these errors were encountered: