Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] UnicodeDecodeError: 'ascii' codec can't decode byte when using from_path #136

Closed
aytey opened this issue Nov 9, 2021 · 2 comments · Fixed by #137
Closed

[BUG] UnicodeDecodeError: 'ascii' codec can't decode byte when using from_path #136

aytey opened this issue Nov 9, 2021 · 2 comments · Fixed by #137
Labels
bug Something isn't working help wanted Extra attention is needed

Comments

@aytey
Copy link

aytey commented Nov 9, 2021

Describe the bug

I have a file such as this:

(venv) avj@vistrrdslin0001 ~/clones/cvise_runners/avj/charset_normalizer$ file temp.txt
temp.txt: C source, ASCII text
(venv) avj@vistrrdslin0001 ~/clones/cvise_runners/avj/charset_normalizer$ du -hs temp.txt
9.6M    temp.txt
(venv) avj@vistrrdslin0001 ~/clones/cvise_runners/avj/charset_normalizer$ wc temp.txt
  188585  1001674 10000082 temp.txt

and I'm trying to parse it with:

#!/usr/bin/env python3

from charset_normalizer import from_path

file = "temp.txt"

lines = [
    line.strip() for line in str(from_path(file).best()).split("\n")
]

using this version of charset_normalizer:

(venv) avj@vistrrdslin0001 ~/clones/cvise_runners/avj/charset_normalizer$ echo -e "import charset_normalizer\nprint(charset_normalizer.version.VERSION)" | python
['2', '0', '7']

On the main file, I get this exception:

(venv) avj@vistrrdslin0001 ~/clones/cvise_runners/avj/charset_normalizer$ ./test.py
Traceback (most recent call last):
  File "./test.py", line 8, in <module>
    line.strip() for line in str(from_path(file).best()).split("\n")
  File "/home/avj/clones/compile_commands_processor/venv/lib64/python3.8/site-packages/charset_normalizer/models.py", line 114, in __str__
    self._string = str(self._payload, self._encoding, "strict")
UnicodeDecodeError: 'ascii' codec can't decode byte 0xb5 in position 813828: ordinal not in range(128)

However, it seems that something "weird" goes on at around the 10000082 character mark:

This crashes (file size: 10000082 chars):

(venv) avj@vistrrdslin0001 ~/clones/cvise_runners/avj/charset_normalizer$ head -n 188602 good.txt | tail -n 188585 > temp.txt && wc -c temp.txt && /usr/bin/time -vvv 
timeout -k 5 -s 9 5 ./test.py
10000082 temp.txt
Traceback (most recent call last):
  File "./test.py", line 8, in <module>
    line.strip() for line in str(from_path(file).best()).split("\n")
  File "/home/avj/clones/compile_commands_processor/venv/lib64/python3.8/site-packages/charset_normalizer/models.py", line 114, in __str__
    self._string = str(self._payload, self._encoding, "strict")
UnicodeDecodeError: 'ascii' codec can't decode byte 0xb5 in position 813522: ordinal not in range(128)
Command exited with non-zero status 1
        Command being timed: "timeout -k 5 -s 9 5 ./test.py"
        User time (seconds): 0.04
        System time (seconds): 0.03
        Percent of CPU this job got: 100%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.07
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 45904
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 0
        Minor (reclaiming a frame) page faults: 5116
        Voluntary context switches: 3
        Involuntary context switches: 3
        Swaps: 0
        File system inputs: 0
        File system outputs: 0
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 1

where as this does not finish within 5 seconds (maybe that's reasonable for a ~10 MiB file) (file size: 9999820 chars):

(venv) avj@vistrrdslin0001 ~/clones/cvise_runners/avj/charset_normalizer$ head -n 188602 good.txt | tail -n 188584 > temp.txt && wc -c temp.txt && /usr/bin/time -vvv 
timeout -k 5 -s 9 5 ./test.py
9999820 temp.txt
Command terminated by signal 9
        Command being timed: "timeout -k 5 -s 9 5 ./test.py"
        User time (seconds): 0.00
        System time (seconds): 0.00
        Percent of CPU this job got: 0%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:05.00
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 10304
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 0
        Minor (reclaiming a frame) page faults: 806
        Voluntary context switches: 2
        Involuntary context switches: 1
        Swaps: 0
        File system inputs: 0
        File system outputs: 0
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0

Now, it would be reasonable to say "okay, but what happens in the one line you've removed?", so we take slightly more head and leave tail alone (file size: 9999847 chars):

(venv) avj@vistrrdslin0001 ~/clones/cvise_runners/avj/charset_normalizer$ head -n 188603 good.txt | tail -n 188585 > temp.txt && wc -c temp.txt && /usr/bin/time -vvv 
timeout -k 5 -s 9 5 ./test.py
9999847 temp.txt
Command terminated by signal 9
        Command being timed: "timeout -k 5 -s 9 5 ./test.py"
        User time (seconds): 0.00
        System time (seconds): 0.00
        Percent of CPU this job got: 0%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:05.00
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 10148
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 0
        Minor (reclaiming a frame) page faults: 772
        Voluntary context switches: 2
        Involuntary context switches: 0
        Swaps: 0
        File system inputs: 0
        File system outputs: 0
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0

To Reproduce

Unfortunately, I am not able to immediately share this file -- I tried to use cvise and halfempty on it to find the smallest file, but hit the road-block at around the 10000082 character mark

Expected behavior

I believe that charset_normalizer shouldn't crash with UnicodeDecodeError: 'ascii' codec can't decode byte when using from_path

Desktop (please complete the following information):

  • OS: Linux
  • Python version: 3.8.12
  • Package version 2.0.7
@Ousret
Copy link
Owner

Ousret commented Nov 9, 2021

Hi,

Thanks for the detailed report. Indeed your observation is verified.
I have put together a small reproducible example that will serve to verify and fix the issue.

Feel free to verify against the branch bugfix-lazy-str-decode-error.
This bug concern only file or bytes seq that is> 1MB.

@aytey
Copy link
Author

aytey commented Nov 10, 2021

I tried it -- works! 🎉

I'll close this issue given the PR is up.

@aytey aytey closed this as completed Nov 10, 2021
Ousret added a commit that referenced this issue Nov 20, 2021
* ✔️ Add simple test case that show the problem (Issue #136)

* 🐛 Fix getting misleaded by large sequence (lazy str loading)

* 🐛 Ignore too insignificant extracted chunk

* 🔖 Bump to 2.0.8.dev3
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working help wanted Extra attention is needed
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants