Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
Possibility to support Chinese codecs? #34
The GB* series of codecs are, like UTF-8, a variable width encoding. The example in the question reads:
which can be decoded using GB* encodings to varying degrees of success:
>>> print text.encode('windows-1252').decode('gb2312', 'replace') 猫垄�姑�⑩dcx�盲赂沤忙��姑ヂ�姑ぢ宦�р�得ヂ�氓�⑩� >>> print text.encode('windows-1252').decode('gbk', 'replace') 猫垄鈥姑�⑩dcx�盲赂沤忙艙鈥姑ヂ鈥姑ぢ宦�р�得ヂ�氓鈥⑩� >>> print text.encode('windows-1252').decode('gb18030', 'replace') 猫垄鈥姑⑩dcx�盲赂沤忙艙鈥姑ヂ鈥姑ぢ宦р�得ヂ氓鈥⑩� >>> print text.encode('windows-1252').decode('big5', 'replace') 癡瞽�嘔阬Ｔdcx�瓣繡鬚疆��嘔氐�嘔刈鄞珍把�腕氐倦疇�Ｔ�
Unfortunately I do not know which one of these is closest to the original, but that doesn't matter all that much. What'd be needed is an analysis of how GB* encodings pushed through the CP1252 / Latin-1 sieve can be distinguished from UTF-8 Mojibakes and handled
Is supporting these codecs feasible?
I don't think it would be possible to support GB* without a huge loss in accuracy. The problem is that most sequences of bytes can be decoded as GB18030, for example, regardless of whether they're actually intended to be GB18030. (Are you sure the text in that example is meant to be Chinese at all?)
The thing that makes ftfy possible is that most sequences of bytes aren't valid UTF-8, so when you can decode something as UTF-8, it's a strong signal that it's the right thing to do.
At one point I looked into trying to support the Japanese encoding Shift-JIS. Even though it has fewer valid sequences than the GB* encodings, I was getting too many false positives on likely sequences of bytes.
Sure, I indeed did not look at what kind of byte sequences GB* codecs produce; if you say it's not feasible due to the false-positive rates, then it's not an option.
Yes, the text in question was meant to be Chinese; the problem was explicitly constrained to text that was either English or Chinese only.
See, to me any Chinese character looks like any other Chinese character and I made the incorrect assumption that by using GB* on the sloppy-cp-1252 result I'd get something approaching valid..
The bytes missing are probably due to un-printable bytes not having been copied into the question; the OP didn't use
Something like this then?
>>> print u'Ã¨Â¢â€¹Ã¨Â¢âdcx€¹Ã¤Â¸Å½Ã¦Å“â€¹Ã¥Ââ€¹Ã¤Â»Â¬Ã§â€ÂµÃ¥ÂÂÃ¥â€¢â€'.encode('sloppy-windows-1252').decode('utf8', 'ignore').encode('sloppy-windows-1252').decode('utf8', 'ignore') 袋dcx与朋们