Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Wrong encoding detected for ascii string #62

Closed
grindsa opened this issue Jul 15, 2021 · 2 comments · Fixed by #63 or #65
Closed

[BUG] Wrong encoding detected for ascii string #62

grindsa opened this issue Jul 15, 2021 · 2 comments · Fixed by #63 or #65
Labels
bug Something isn't working help wanted Extra attention is needed

Comments

@grindsa
Copy link

grindsa commented Jul 15, 2021

Describe the bug:

Looks like charset_normalizer detects the below ascii string incorrectly as utf_16_le while charset detects it as ascii.

To Reproduce:

>>> rawdata = b'g4UsPJdfzNkGW2jwmKDGDilKGKYtpF2X.mx3MaTWL1tL7CNn5U7DeCcodKX7S3lwwJPKNjBT8etY'

>>> import charset_normalizer
>>> detected_cn = charset_normalizer.detect(rawdata)
>>> detected_cn
{'encoding': 'utf_16_le', 'language': '', 'confidence': 1.0}

>>> import chardet
>>> detected_cd = chardet.detect(rawdata)
>>> print(detected_cd)
{'encoding': 'ascii', 'confidence': 1.0, 'language': ''}
>>>

Expected behavior:

String should be detected as ascii

Desktop (please complete the following information):

OS: [e.g. Linux, Windows or Mac] Linux, Windows
Python version [e.g. 3.5] 3.8.2
@grindsa grindsa added bug Something isn't working help wanted Extra attention is needed labels Jul 15, 2021
@sakibstark11
Copy link

also when trying to import; it throws an encoding error on the documentation

SyntaxError: Non-ASCII character '\xd1' in file /home/xyz/xyz-service/charset_normalizer/__init__.py on line 12, but no encoding declared; see http://python.org/dev/peps/pep-0263/ for details

@Ousret
Copy link
Collaborator

Ousret commented Jul 15, 2021

Hi @grindsa

Thanks for your report. It has to do with the mess detection, the way Charset-Normalizer has been made makes it expected for those particular bytes.
I will try to improve ASCII support with another PR.

2021-07-15 08:39:36,420 | WARNING | cp_isolation is set. use this flag for debugging purpose. limited list of encoding allowed : ascii, utf_7, utf_16_le.
2021-07-15 08:39:36,420 | WARNING | override steps (5) and chunk_size (512) as content does not fit (76 byte(s) given) parameters.
2021-07-15 08:39:36,422 | WARNING | ascii was excluded because of initial chaos probing. Gave up 1 time(s). Computed mean chaos is 27.600000 %.
2021-07-15 08:39:36,422 | INFO | Code page utf_16_le is a multi byte encoding table and it appear that at least one character was encoded using n-bytes. Should not be a coincidence. Priority +1 given.
2021-07-15 08:39:36,423 | INFO | utf_16_le passed initial chaos probing. Mean measured chaos is 0.000000 %
2021-07-15 08:39:36,424 | WARNING | utf_7 was excluded because of initial chaos probing. Gave up 1 time(s). Computed mean chaos is 27.600000 %.

My preliminary thought is to bypass the MD for ASCII detection. At least partially.

Hi @sakibstark11

I am sorry, but it seems that you are running an unsupported Python version. Either upgrade your Python or uninstall Charset-Normalizer.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working help wanted Extra attention is needed
Projects
None yet
3 participants