GB18030 file detected as WINDOWS-1252 #23

yanxijian · 2016-01-20T14:22:16Z

I encountered an issue when I try to use uchardet to detect a file almost all English(a very little Chinese).
The file is GB18030 encoded but detected as WINDOWS-1252.

Jehan · 2016-01-20T14:25:08Z

Hi,

On Wed, Jan 20, 2016 at 3:22 PM, Xijian Yan notifications@github.com
wrote:

I encountered an issue when I try to use uchardet to detect a file almost
all English(a very little Chinese)
The file is GB18030 encoded but detected as WINDOWS-1252

Could you send the file and make a bug report about it?
I'll have a look when I can. Thanks!

Jehan

—
Reply to this email directly or view it on GitHub
#23.

yanxijian · 2016-01-20T14:30:59Z

@Jehan I'm home now. I will send it to you at work tomorrow. Thanks for concerns. :)

yanxijian · 2016-01-21T01:42:37Z

my love gbk.txt
Hi, @Jehan .That's the file.

Jehan · 2016-01-21T18:09:50Z

Ok I had a quick look. It looks like the current implementation only even tries to detect GB18030 when the escape characters are used, which is not the case of your file. Apparently there are several ways to encode the same characters, sometimes with escape characters, sometimes not (from the few I can understand with a quick look on Wikipedia).

Also I guess that your file being mostly English, it does not help, statistically.

I'll need to have a closer look to the structure of this encoding later so that I can try and fix this. Thanks for the report!

yanxijian · 2016-01-25T02:57:06Z

I read some articles. The reason might be using GBK or GB18030 characters in file encoded in GB2312. GB2312, GBK, GB18030 are all Simplified Chinese encoding. GB18030 is the latest.

Jehan · 2016-08-28T22:21:32Z

Moving bug reports to the new project hosting.
Work will happen there: https://bugs.freedesktop.org/show_bug.cgi?id=97521

Jehan closed this as completed Aug 28, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GB18030 file detected as WINDOWS-1252 #23

GB18030 file detected as WINDOWS-1252 #23

yanxijian commented Jan 20, 2016

Jehan commented Jan 20, 2016

yanxijian commented Jan 20, 2016

yanxijian commented Jan 21, 2016

Jehan commented Jan 21, 2016

yanxijian commented Jan 25, 2016

Jehan commented Aug 28, 2016

GB18030 file detected as WINDOWS-1252 #23

GB18030 file detected as WINDOWS-1252 #23

Comments

yanxijian commented Jan 20, 2016

Jehan commented Jan 20, 2016

yanxijian commented Jan 20, 2016

yanxijian commented Jan 21, 2016

Jehan commented Jan 21, 2016

yanxijian commented Jan 25, 2016

Jehan commented Aug 28, 2016