New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possibility to support Chinese codecs? #34

Closed
mjpieters opened this Issue Mar 16, 2015 · 4 comments

Comments

Projects
None yet
2 participants
@mjpieters

mjpieters commented Mar 16, 2015

Based on this Stack Overflow question I looked into support for Chinese character encodings.

The GB* series of codecs are, like UTF-8, a variable width encoding. The example in the question reads:

袋è¢âdcx€¹Ã¤Â¸Å½Ã¦Å“‹å‹们çâ€ÂµÃ¥Â­Âå•â€

which can be decoded using GB* encodings to varying degrees of success:

>>> print text.encode('windows-1252').decode('gb2312', 'replace')
猫垄�姑�⑩dcx�盲赂沤忙��姑ヂ�姑ぢ宦�р�得ヂ�氓�⑩�
>>> print text.encode('windows-1252').decode('gbk', 'replace')
猫垄鈥姑�⑩dcx�盲赂沤忙艙鈥姑ヂ鈥姑ぢ宦�р�得ヂ�氓鈥⑩�
>>> print text.encode('windows-1252').decode('gb18030', 'replace')
猫垄鈥姑⑩dcx�盲赂沤忙艙鈥姑ヂ鈥姑ぢ宦р�得ヂ氓鈥⑩�
>>> print text.encode('windows-1252').decode('big5', 'replace')
癡瞽�嘔阬Tdcx�瓣繡鬚疆��嘔氐�嘔刈鄞珍把�腕氐倦疇�T�

Unfortunately I do not know which one of these is closest to the original, but that doesn't matter all that much. What'd be needed is an analysis of how GB* encodings pushed through the CP1252 / Latin-1 sieve can be distinguished from UTF-8 Mojibakes and handled fix_one_step_and_explain().

Is supporting these codecs feasible?

@rspeer

This comment has been minimized.

Show comment
Hide comment
@rspeer

rspeer Mar 16, 2015

Member

I don't think it would be possible to support GB* without a huge loss in accuracy. The problem is that most sequences of bytes can be decoded as GB18030, for example, regardless of whether they're actually intended to be GB18030. (Are you sure the text in that example is meant to be Chinese at all?)

The thing that makes ftfy possible is that most sequences of bytes aren't valid UTF-8, so when you can decode something as UTF-8, it's a strong signal that it's the right thing to do.

At one point I looked into trying to support the Japanese encoding Shift-JIS. Even though it has fewer valid sequences than the GB* encodings, I was getting too many false positives on likely sequences of bytes.

Member

rspeer commented Mar 16, 2015

I don't think it would be possible to support GB* without a huge loss in accuracy. The problem is that most sequences of bytes can be decoded as GB18030, for example, regardless of whether they're actually intended to be GB18030. (Are you sure the text in that example is meant to be Chinese at all?)

The thing that makes ftfy possible is that most sequences of bytes aren't valid UTF-8, so when you can decode something as UTF-8, it's a strong signal that it's the right thing to do.

At one point I looked into trying to support the Japanese encoding Shift-JIS. Even though it has fewer valid sequences than the GB* encodings, I was getting too many false positives on likely sequences of bytes.

@rspeer rspeer closed this Mar 16, 2015

@mjpieters

This comment has been minimized.

Show comment
Hide comment
@mjpieters

mjpieters Mar 16, 2015

Sure, I indeed did not look at what kind of byte sequences GB* codecs produce; if you say it's not feasible due to the false-positive rates, then it's not an option.

Yes, the text in question was meant to be Chinese; the problem was explicitly constrained to text that was either English or Chinese only.

mjpieters commented Mar 16, 2015

Sure, I indeed did not look at what kind of byte sequences GB* codecs produce; if you say it's not feasible due to the false-positive rates, then it's not an option.

Yes, the text in question was meant to be Chinese; the problem was explicitly constrained to text that was either English or Chinese only.

@rspeer

This comment has been minimized.

Show comment
Hide comment
@rspeer

rspeer Mar 16, 2015

Member

This string is an interesting puzzle. I'm coming to the conclusion that it's not actually GB* - it seems to be Chinese in triple-UTF-8 with some bytes missing.

Member

rspeer commented Mar 16, 2015

This string is an interesting puzzle. I'm coming to the conclusion that it's not actually GB* - it seems to be Chinese in triple-UTF-8 with some bytes missing.

@mjpieters

This comment has been minimized.

Show comment
Hide comment
@mjpieters

mjpieters Mar 16, 2015

See, to me any Chinese character looks like any other Chinese character and I made the incorrect assumption that by using GB* on the sloppy-cp-1252 result I'd get something approaching valid..

The bytes missing are probably due to un-printable bytes not having been copied into the question; the OP didn't use repr() here after all.

Something like this then?

>>> print u'袋è¢âdcx€¹Ã¤Â¸Å½Ã¦Å“‹å‹们çâ€ÂµÃ¥Â­Âå•â€'.encode('sloppy-windows-1252').decode('utf8', 'ignore').encode('sloppy-windows-1252').decode('utf8', 'ignore')
袋dcx与朋们

mjpieters commented Mar 16, 2015

See, to me any Chinese character looks like any other Chinese character and I made the incorrect assumption that by using GB* on the sloppy-cp-1252 result I'd get something approaching valid..

The bytes missing are probably due to un-printable bytes not having been copied into the question; the OP didn't use repr() here after all.

Something like this then?

>>> print u'袋è¢âdcx€¹Ã¤Â¸Å½Ã¦Å“‹å‹们çâ€ÂµÃ¥Â­Âå•â€'.encode('sloppy-windows-1252').decode('utf8', 'ignore').encode('sloppy-windows-1252').decode('utf8', 'ignore')
袋dcx与朋们
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment