[BUG] Wrong encoding detected for ascii string #62

grindsa · 2021-07-15T05:50:19Z

Describe the bug:

Looks like charset_normalizer detects the below ascii string incorrectly as utf_16_le while charset detects it as ascii.

To Reproduce:

>>> rawdata = b'g4UsPJdfzNkGW2jwmKDGDilKGKYtpF2X.mx3MaTWL1tL7CNn5U7DeCcodKX7S3lwwJPKNjBT8etY'

>>> import charset_normalizer
>>> detected_cn = charset_normalizer.detect(rawdata)
>>> detected_cn
{'encoding': 'utf_16_le', 'language': '', 'confidence': 1.0}

>>> import chardet
>>> detected_cd = chardet.detect(rawdata)
>>> print(detected_cd)
{'encoding': 'ascii', 'confidence': 1.0, 'language': ''}
>>>

Expected behavior:

String should be detected as ascii

Desktop (please complete the following information):

OS: [e.g. Linux, Windows or Mac] Linux, Windows
Python version [e.g. 3.5] 3.8.2

The text was updated successfully, but these errors were encountered:

sakibstark11 · 2021-07-15T06:19:38Z

also when trying to import; it throws an encoding error on the documentation

SyntaxError: Non-ASCII character '\xd1' in file /home/xyz/xyz-service/charset_normalizer/__init__.py on line 12, but no encoding declared; see http://python.org/dev/peps/pep-0263/ for details

Ousret · 2021-07-15T06:41:48Z

Hi @grindsa

Thanks for your report. It has to do with the mess detection, the way Charset-Normalizer has been made makes it expected for those particular bytes.
I will try to improve ASCII support with another PR.

2021-07-15 08:39:36,420 | WARNING | cp_isolation is set. use this flag for debugging purpose. limited list of encoding allowed : ascii, utf_7, utf_16_le.
2021-07-15 08:39:36,420 | WARNING | override steps (5) and chunk_size (512) as content does not fit (76 byte(s) given) parameters.
2021-07-15 08:39:36,422 | WARNING | ascii was excluded because of initial chaos probing. Gave up 1 time(s). Computed mean chaos is 27.600000 %.
2021-07-15 08:39:36,422 | INFO | Code page utf_16_le is a multi byte encoding table and it appear that at least one character was encoded using n-bytes. Should not be a coincidence. Priority +1 given.
2021-07-15 08:39:36,423 | INFO | utf_16_le passed initial chaos probing. Mean measured chaos is 0.000000 %
2021-07-15 08:39:36,424 | WARNING | utf_7 was excluded because of initial chaos probing. Gave up 1 time(s). Computed mean chaos is 27.600000 %.

My preliminary thought is to bypass the MD for ASCII detection. At least partially.

Hi @sakibstark11

I am sorry, but it seems that you are running an unsupported Python version. Either upgrade your Python or uninstall Charset-Normalizer.

grindsa added bug Something isn't working help wanted Extra attention is needed labels Jul 15, 2021

Ousret mentioned this issue Jul 15, 2021

❇️ Improvement over ASCII detection, adjust the sensibility of 'ArchaicUpperLowerPlugin' #63

Merged

Ousret closed this as completed in #63 Jul 15, 2021

Ousret mentioned this issue Jul 15, 2021

v2.0.3 #65

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Wrong encoding detected for ascii string #62

[BUG] Wrong encoding detected for ascii string #62

grindsa commented Jul 15, 2021

sakibstark11 commented Jul 15, 2021

Ousret commented Jul 15, 2021

[BUG] Wrong encoding detected for ascii string #62

[BUG] Wrong encoding detected for ascii string #62

Comments

grindsa commented Jul 15, 2021

sakibstark11 commented Jul 15, 2021

Ousret commented Jul 15, 2021