Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GB18030 file detected as WINDOWS-1252 #23

Closed
yanxijian opened this issue Jan 20, 2016 · 6 comments
Closed

GB18030 file detected as WINDOWS-1252 #23

yanxijian opened this issue Jan 20, 2016 · 6 comments

Comments

@yanxijian
Copy link

I encountered an issue when I try to use uchardet to detect a file almost all English(a very little Chinese).
The file is GB18030 encoded but detected as WINDOWS-1252.

@Jehan
Copy link
Collaborator

Jehan commented Jan 20, 2016

Hi,

On Wed, Jan 20, 2016 at 3:22 PM, Xijian Yan notifications@github.com
wrote:

I encountered an issue when I try to use uchardet to detect a file almost
all English(a very little Chinese)
The file is GB18030 encoded but detected as WINDOWS-1252

Could you send the file and make a bug report about it?
I'll have a look when I can. Thanks!

Jehan


Reply to this email directly or view it on GitHub
#23.

@yanxijian
Copy link
Author

@Jehan I'm home now. I will send it to you at work tomorrow. Thanks for concerns. :)

@yanxijian
Copy link
Author

my love gbk.txt
Hi, @Jehan .That's the file.

@Jehan
Copy link
Collaborator

Jehan commented Jan 21, 2016

Ok I had a quick look. It looks like the current implementation only even tries to detect GB18030 when the escape characters are used, which is not the case of your file. Apparently there are several ways to encode the same characters, sometimes with escape characters, sometimes not (from the few I can understand with a quick look on Wikipedia).

Also I guess that your file being mostly English, it does not help, statistically.

I'll need to have a closer look to the structure of this encoding later so that I can try and fix this. Thanks for the report!

@yanxijian
Copy link
Author

I read some articles. The reason might be using GBK or GB18030 characters in file encoded in GB2312. GB2312, GBK, GB18030 are all Simplified Chinese encoding. GB18030 is the latest.

@Jehan
Copy link
Collaborator

Jehan commented Aug 28, 2016

Moving bug reports to the new project hosting.
Work will happen there: https://bugs.freedesktop.org/show_bug.cgi?id=97521

@Jehan Jehan closed this as completed Aug 28, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants